Analysis of Variables Affecting Mortality¶

The purpose of this project is to analyze the effects on mortality based on 6 different variables. The data collected that is used in the project is based on a study in the United States which includes data from 60 different cities in the 1960s. The predictor variables used in this analysis, is PRECIP (mean precipitation in inches), EDUC (mean number of school years completed by people age 25 and over), NONWHITE (percentage of population that is nonwhite), POOR (percentage of population that has an annual income under $3000 in the 1960s), NOX (relative pollution potential of nitrogen oxides), and SO2 (relative pollution potential of sulfur dioxide). The response variable is MORTALITY (deaths from all causes per 100,000 people). 4 response variables, NONWHITE, POOR, NOX, and SO2 are skewed so in this project, variables SO2 and NOX have been transformed by a natural logarithm and variables NONWHITE and POOR have been transformed by a cube root.

The correlation matrix of the transformed data is as follows:

The matrix plot of the transformed data is:

Fitting the regression function results in the following function:

Y = 980.475 + 2.375X1 – 19.100X2 + 10.104X3 + 8.031X4 + 49.905X5 -31.098X6

where the variables X1 is PRECIP, X2 is EDUC, X3 is NOX, X4 is SO2, X5 is NONWHITE, and X6 is POOR and estimates for B0 is 980.475, B1 is 2.375, B2 is -19.100, B3 is 10.104, B4 is 8.031, B5 is 49.905, and B6 is -31.098.

The summary of the regression function is as follows:

The standard errors are as follows based on the summary table:

B0 – 141.9266

B1 – 0.6709

B2 – 7.6787

B3 – 7.1973

B4 – 5.6263

B5 – 11.3256

B6 – 34.5908

The ANOVA table for the model is as follows:

Plot of Observed Y vs Fitted Y:

Plot of Residuals vs X1:

Plot of Residuals vs X2:

Plot of Residuals vs X3:

Plot of Residuals vs X4:

Plot of Residuals vs X5:

Plot of Residuals vs X6:

Histogram of Residuals

Normal Probability Plot of Residuals

Based on the shapes of the plots for observed Y vs fitted Y and the shapes of the plots for residuals against all predictor variables, I do not believe that there is nonlinearity in the data as all of the plots seem to be linear.

After using the all subsets method and the stepwise regression method, I believe that two variables, POOR and NOX can be dropped from the model. By using the all subsets method, I obtained adjusted R squared values along with Mallow’s CP value and BIC values from the best possible models that would be a good fit for the data and the highest adjusted R squared value along with the lowest Mallow’s CP value and BIC values all point to the best model being the model where the predictor variables POOR and NOX should be dropped from the model.

Using backwards stepwise regression in R, I was able to drop the predictor variable POOR as it has the lowest AIC value compared to the other variables and compared the none threshold variable. Then I was able to drop the variable NOX has it had the lowest AIC value and lower than the threshold value. Finally, doing a third step, none of the other predictor variables had an AIC lower than the threshold value and I left my model with 4 predictor variables, dropping POOR and NOX. Stepwise regression in this case, gave the same result as the all subsets method. The new model after using stepwise regression is Y = 883.03 + 1.90X1 -15.22X2 + 14.95X4 + 49.40X5. X3 and X6, predictors, NOX and POOR are dropped from the new model.

>In summary, I have found that based on the R squared value for the data, about 69.85% of the data for mortality can be explained by the regression function I modeled without dropping any variables. Mortality seems to be correlated positively with precipitation, pollution potentials of NOX and SO2, the population of nonwhites, and the number of the poor, and negatively correlated with the amount of schooling the population received based on the correlation matrix. Some predicator variables however, such as with NOX and SO2, or POOR and NONWHITE have a strong correlation with each other and may affect how the data could be interpreted. Mortality seems to be most positively correlated with the amount of NONWHITES based on the values in the correlation matrix (having the highest compared to the other predictor variables). After dropping predictor variables, NOX and POOR from the model, about 68.47% of the data can be explained by the new model. After dropping the two predictor variables, none of the predictor variables seem to be correlated a lot with each other. Although the R squared value for the full model is higher, I noticed that some predictor variables are rather strongly correlated with each other in the correlation matrix such as the correlation being 0.73 for NOX and SO2 and 0.6 for POOR and NONWHITE which may not be a good indication of how much of the actual variables are explained in the data. Dropping the POOR and NOX predictor variables reduces R squared slightly but will end up with none of the predictor variables being strongly correlated with each other. More analysis of the data could be necessary since initially, some predictor variables are strongly correlated with each other which may not give accurate results.