**How Do I Calculate Life Expectancy?**

Humans are products of genetics, but also products of their environments. This realization has prompted the question of what factors in areas such as work and health can affect the ultimate outcome for a person, which is life expectancy. By exploring variables within the realms of these two questions, observations about their effects on human lifespan can be made. To answer this question, I have formulate two separate hypotheses for each area, work and health.

**Geographic areas with less access to health insurance and healthcare as well as higher populations of unhealthy people, such as smokers and obese people, have decreased average life expectancies**

**Geographic areas with lower labor force participation rates, declining economies, and higher populations working in manufacturing have decreased average life expectancies.**

These assumptions carry weight, and it is important to use many tools to ensure they are correct. In this project, I plan to use simple linear regression to begin to analyze and separate the variables I use from Opportunity Insights (OI). OI is a non-profit that is dedicated to studying and finding solutions to economic inequality within America through data exploration. The OI website sources hundreds of datasets that “allow you to analyze social mobility and a variety of other outcomes from life expectancy to patent rates by neighborhood, college, parental income level, and racial background” (Opportunity Insights, 2018). To explore these questions, I used two datasets from OI that grouped life expectancy averages by commuting zones. To learn more about this approach, see the Data section.

To begin to use this data to answer these questions, I performed simple linear regressions to visualize and understand more basic relationships between various variables and life expectancy, which would help me later on when creating a multiple regression model.

## Data

All data used in this project comes from Opportunity Insights. To explore this question, I used two datasets: one containing commuting zone level life expectancy estimates, and a dataset reporting characteristics of commuting zones, such as fraction of smokers, or the unemployment rate. By definition, commuting zones are “geographic units of analysis intended to more closely reflect the local economy where people live and work” (USDA, 2019). They were developed to explore economic growth in a way other than zip codes, which are harder to collect data from, as there are more of them. To explore this question to the fullest extent, I used the R programming environment to create models and derive statistics from the data.

The first dataset is a report of commuting zone level life expectancy estimates sorted by income quartile. Life expectancy is a factor that increases as one ages, so these values were taken from men and women at age 40. This data is sorted by income quartile, meaning the first income quartile consists of the 1st to 25th percentile, the second consists of the 26th to 50th percentile, and so on. OI is focused on exploring social mobility, which is the reasoning behind the data being sorted by quartile.

In the simple linear regression analysis, all income quartiles become explored. However, in the final multiple regression models, only the first income quartile becomes explored. (Combining all income quartile values for life expectancy into one overall life expectancy value for each commuting zone will cause for many statistical disparities, and the first income quartile consistently had the most statistically significant data, which is the reasoning behind this decision.

The next dataset used lists characteristics of multiple characteristics of commuting zones regarding factors in the realms of educations, healthcare, and the labor force. To collect these statistics, OI used data from multiple sources such as Census data records and IRS income statistics. Within the commuting zone characteristics dataset, many statistics are reported as percentages except Medicare expenditures per enrollee and mean household income. Each statistic is collected as the average within that commuting zone. Further, three statistics, smoking, exercise, and obesity rates, were collected as an average of a commuting zone but additionally sorted into income quartiles.

Within the regression models, each datapoint is the average of the people living in the commuting zones. The data was collected from 596 commuting zones out of a total of 740 commuting zones in America (Martin et al. 1). Data is representative of 80% of commuting zones because Opportunity Insights only included statistics from zones with populations over 25,000.

# Simple Regression Exploration

Before beginning to create a multiple regression model, it is important to understand the relationships that exist between the predictor and response variables within linear models, and to do this, we can use simple linear regression.

## Health Variables

**Obesity**

I decided to first explore the effect of rates of obesity within commuting zones on life expectancy. The OI data groups both life expectancy and obesity rates by income quartile, so I chose to explore each of the four income quartiles through the usage of a scatterplot and linear regression in R. Figure 0.1 shows scatterplots of life expectancy plotted on the fractional average rate of obesity within each respective income quartile.

##### Figure 0.1

```
figure = ggarrange(q1, q2, q3, q4,
labels = c("Q1", "Q2", "Q3", "Q4"))
plot(figure)
```

By exploring each quartile separately, we can observe how rates of obesity effect each income tier differently. The regression outputs for each quartile are as follows (where x is the fractional rate of obesity):

```
**Quartile 1: (LE) = 79.9237 - 2.3808X**
**Quartile 2: (LE) = 83.3790 - 3.2225X**
**Quartile 3: (LE) = 85.0321 - 1.9639X**
**Quartile 4: (LE) = 86.9885 - 3.1784X**
```

Here, it is interesting to observe the Y-intercepts of each income quartile. As income level goes up, so does the intercept. The Y-intercept tells us the expected value of Y when X is zero. From looking at each individual quartile, we can see that as the expected value of life expectancy when the rate of obesity is zero increases by income.

Here, all four income quartiles have a negative slope. This tells us that as the fractional average of obese population within a commuting zone increases, life expectancy decreases no matter the income quartile. Within the first income quartile, the slope tells us that for every 0.01 (1%) change in the percent obese average, life expectancy is predicted to decrease by 0.0238 years, or decrease by 8.687 days.

Let’s look at slope as an example. In the South Boston commuting zone, the fractional rate of obesity within the first quartile is 0.285. Let’s substitute this value into the linear regression equation for the first income quartile:

`79.9237 - 2.3808(0.285) = 79.245`

Within the first income quartile, if a commuting zone has an average obesity rate of 0.285, the predicted life expectancy of that commuting zone is 79.245 years. Now, let’s use this same value of 0.285 in the regression equation for the fourth income quartile:

86.9885 – 3.1784(0.285) = 86.0826

The difference between these two outputs is 6.8376 years, telling us that even if the rate of obesity within two different income quartiles is the same, the person living in the fourth income quartile has a longer predicted life expectancy.

When comparing the slopes of the different income quartiles, we see that the fourth income quartile’s slope has the greatest effect, as it is the absolute greatest slope factor, while the third income quartile’s slope has the least effect on its regression equation. This means that an increased population of obese people has a greater effect on the fourth quartile than any other quartile.

As these results are statistically significant, (p = 0.0001238, p = 2.593e-08, p = 2.07e-05, p = 5.669e-09, in order of appearance), meaning that we can safely assume that there is a linear relationship with the life expectancy decreasing as the fractional average of obese people increases.

**Smoking**

Smoking is understood to be detrimental to one’s health, but seeing as the data was separated by income quartiles, similar to the obesity data, I decided to complete a linear regression exploration of this data by income quartile. Figure 0.2 show the scatterplots and regression outputs of life expectancy plotted on the average fractional rate of current smokers within each respective income quartile.

##### Figure 0.2

```
figure = ggarrange(sq1, sq2, sq3, sq4,
labels = c("Q1", "Q2", "Q3", "Q4"))
plot(figure)
```

The regression outputs for each quartile are as follows (where x is the fractional rate of current smokers):

```
**Quartile 1: (LE) = 80.689 - 5.352X**
**Quartile 2: (LE) = 83.1997 - 3.1559X **
**Quartile 3: (LE) = 84.8161 - 1.7895**
**Quartile 4: (LE) = 86.7688 - 4.1519 **
```

Again, we see that as the income quartile increases, the Y-intercept increases as well, the value that is equal to the expected life expectancy when the rate of smokers is equal to zero. However, the predicted value of life expectancy decreases as the rate of smoking increases no matter the income quartile, made clear by the negative coefficients within each quartile.

It is interesting to note the different coefficients, as there is no clear relationship between their values and income quartile as there appears to be with the Y-intercepts. The first income quartile’s coefficient of 5.352X tells us that in terms of this dataset, smoking has the greatest effect on life expectancy within the first income quartile than any other quartile. In terms of years, the -5.352 coefficient in the first income quartile shows that for every 0.01 change in the average percentage of smokers, life expectancy is expected to decrease by 0.05352 years, or decrease by 19.53 days.

To again interpret the coefficients, we can use an example. In the South Boston commuting zone, the current percentage of smokers within the first quartile is 0.252. Let’s substitute this value into the linear regression equation for the first income quartile:

`**80.689 - 5.352(0.252) = 79.3402**`

The difference between the mean value of life expectancy and the output when the percentage of smokers is 25.2% is 1.348. This tells us that as the percentage of smokers within a commuting zone increases, life expectancy decreases, leading us to the conclusion that smoking decreases life expectancy. Because the p-values are statistically significant, (p = 2.238e-15, p = 3.022e-05, p = 0.002072, p = 8.884e-08), it is safe to make assumptions about the data.

#### Percentage of Uninsured People

The next health related variable I chose to explore was the percentage of uninsured people within a commuting zone. I chose this specific variable to explore the decision to have health insurance is not one that a person always has the free will to choose to make. For example, a person can decide to start smoking, which is something that is detrimental to one’s health. A person can not always choose to have medical insurance, especially if they are unemployed and are not receiving benefits from a job, or if they simply cannot afford the skyrocketing costs associated with health insurance. Health insurance is very indicative of health in the way that it allows essential health services such as emergency room visits, laboratory tests, substance-abuse treatment, and prescription drugs to be accessible to those who cannot afford these these services.

Although the health insurance variable is not grouped by income quartile similar to the obesity and smoking variables, I elected to again plot it among the life expectancy estimates of the four income quartiles to observe differences between them, and how presence of health insurance affects varying levels of income differently. The scatterplots and regression outputs are shown in Figure 0.3:

##### Figure 0.3

The regression outputs are as follows (where x is the percentage of uninsured people within a commuting zone):

```
**Quartile 1: (LE) = 79.6307 - 2.3431X **
**Quartile 2: (LE) = 83.8513 - 7.4084X **
**Quartile 3: (LE) = 86.1536 - 9.0602X**
**Quartile 4: (LE) = 87.7835 - 8.4030X **
```

The trend of the Y-intercept increasing as income increases is again present. However, there is a change in trend of the coefficients. Here, the coefficient for Q1 is the smallest, while Q3 has the greatest coefficient, meaning that being uninsured has a greater effect on life expectancies within the 3rd and 4th quartiles, as they have the greatest coefficients of the four.

Within all four income quartiles, the negative coefficient indicates a negative relationship between greater rates of uninsured people and life expectancy. Within the first income quartile, the slope of 2.3431 tells us that for every 1% change in the percentage of uninsured people, life expectancy is expected to decrease by 8.55 days, or 0.02341 years.

To interpret the coefficient, we can use an example. In the South Boston commuting zone, the percentage of uninsured people was 0.0501, or 5.01%. Let’s substitute this value into the linear regression equation for the first income quartile as well as for the fourth quartile:

**Quartile 1: 79.6307 – 2.3431(0.0501) = 79.51**

(Difference from mean: 0.1207)

**Quartile 4: 87.7835 – 8.4030(0.0501) = 87.36**

(Difference from mean: 0.5153)

Despite the percentage of uninsured people being the same, a 5% rate of uninsured people had a greater effect on the life expectancies of people within the fourth income quartile. Here, the p-values are statistically significant (p = 0.01022, p = 2.2e-16, p = 2.2e-16, p = 2.2e-16), meaning we can make assumptions about the results. It can be assumed that a greater population of people within the 3rd and 4th income quartiles are able to afford health insurance, which means that the majority of them have it. I feel that the coefficients are higher for these quartiles because they can afford it.

## Work/Labor Variables

#### Fraction of People Working In Manufacturing

Working in manufacturing requires one to be surrounded by heavy machinery and moving parts for hours at a time, making it a dangerous job. This can cause many workplace injuries, which can often cause one to have to not be employed for months at a time. This period of unemployment can mean the difference between being able to pay for a doctor’s visit, and for that reason, among others, I chose to explore this variable within each income quartile. The scatterplots and regression outputs are shown in Figures 1.13 to 1.16, and the linear regression outputs are as follows (where X is equal to the fraction of people in a commuting zone working in manufacturing):

##### Figure 0.4

```
figure = ggarrange(mq1,
labels = c("Q1"))
plot(figure)
```

**Quartile 1: (LE) = 79.5143 – 1.9435X**

The only statistically significant results came from the first income quartile, so I chose to only explore these results(p = 0.0009991). This raises the question of why this is the only quartile with statistically significant data. It is a fact that people working in manufacturing or other blue collar industries are often paid less than those with office jobs, leading to the idea that Q1 data is significant because it has the highest percentage of people working in manufacturing compared to other quartiles.

Figure 1.13 shows that as the average fractional rate of people working in manufacturing increases, life expectancy decreases. To interpret the negative coefficient, we can again use the value from the South Boston commuting zone to see the impact of working in manufacturing on life expectancy. In the South Boston commuting zone, 0.1203 or 12.03% of the population works in manufacturing. Substituting this value into the Q1 regression output, we see:

`**79.5143 - 1.9435(0.1203) = 79.2804**`

This value tells us that when 12.03% of a population within a commuting zone works in manufacturing in the first income quartile, their expected life expectancy is 79.2804 years, which is a 0.2339 difference from the mean of 79.5143. Overall, the trend shows that as the average percentage of people working in manufacturing increases, life expectancy decreases. For every 1% change in the percentage of people working in manufacturing, life expectancy is expected to decrease by 7.09 days, or 0.0194 years.

#### Labor Force Participation Rate

The next variable I chose to explore with linear regression is the labor force participation rate. The definition of the labor force is the percentage of the civilian population that is not institutionalized, 16 and older, and is working, or actively looking for work . I chose to explore this variable because labor force participation is indicative of many things, such as the amount of labor resources available within a given market, or health of an economy. The scatterplots and regression outputs are shown in Figure 0.5:

##### Figure 0.5

```
figure = ggarrange(q1, q2, q3, q4,
labels = c("Q1", "Q2", "Q3", "Q4"))
plot(figure)
```

The regression equations are as follows: (where X is the labor force participation rate percentage in a commuting zone):

```
**Quartile 1: (LE) = 77.9314 + 2.0669X**
**Quartile 2: (LE) = 79.4858 + 4.9025X**
**Quartile 3: (LE) = 81.0273 + 5.6511X**
**Quartile 4: (LE) = 81.8176 + 7.2043X**
```

Here, the trend of the Y-intercept increasing alongside income quartile is again apparent, as well as the trend of the correlational coefficient increasing as the income quartile increases as well. All slopes are positive, showing a positive relationship between life expectancy and the amount of people in the workforce. Within the first income quartile, a 1% change in the workforce is predicted to increase life expectancy by 7.54 days, or 0.0206 years. Because the results are statistically significant, (p = 0.008658, p = 1.553e-11, p = 8.98e-16, p = 2.2e-16), it is safe to make assumptions for the data.

To make sense of the slopes, we can again use the South Boston commuting zone as an example for two income quartile examples, in this case the first and the fourth. In the South Boston commuting zone, the labor force participation rate is 0.665.

```
**Quartile 1: 77.9314 + 2.0669(0.665) = 79.306**
**(Difference of means = + 1.3746)**
**Quartile 4: 81.8176 + 7.2043(0.665) = 86.608**
**Difference of means = + 4.7904)**
```

Within the same commuting zone, we can see that the same labor force participation rate created a greater increase in life expectancy in the fourth income quartile than in the first, with the difference between the two values relative to each other being 3.4158. This tells us that the labor force participation rate has a greater impact within the fourth quartile than on any other quartile.

#### Fraction of People with Commutes Less Than 15 Minutes

The next variable I chose to explore was the fractional average of people in a commuting zone who have commute times that are less than 15 minutes. I chose to explore this variable because of the noted benefits of having a shorter commute. A shorter commute means less time spent in traffic, waiting for a train, or walking to work. A shorter, less stressful commute can allow for an employee to dedicate time to activities other than work, which can be beneficial for one’s health. Scatterplots graphing the fractional average of people with commutes less than 15 minutes on average life expectancy are shown in Figure 0.6:

##### Figure 0.6

```
figure = ggarrange(q1, q2, q3, q4,
labels = c("Q1", "Q2", "Q3", "Q4"))
plot(figure)
```

The regression outputs are as follows, where X is the fraction of people with a commute less than 15 minutes in a commuting zone:

**Quartile 1: (LE) = 78.7829 + 1.0295X**

**Quartile 2: (LE) = 81.3830 + 2.7391X**

**Quartile 3: (LE) = 83.3716 + 2.7752X**

**Quartile 4: (LE) = 85.8372 + 1.0346X**

Here, all slopes regardless of income are positive, showing a positive relationship between life expectancy and the percentage of people with shorter commutes. Within the first income quartile, a 1% change in the percentage of people with commutes greater than 15 minutes is predicted to increase life expectancy by 3.757 days, of 0.0193 years. Because our results are statistically significant, (p = 0.01281, p = 6.394e-13, p = 6.823e-14, p = 0.01024), it is safe to make assumptions about the data.

However, it is interesting to observe that the income quartiles with the greatest p-values, Q1 and Q4, have the smallest correlational coefficients. These results support my hypothesis, as they show a positive relationship between shorter commute times and increased life expectancies.

#### Percent Change in Labor Force 1980-2000

The last work-related variable I chose to explore using simple linear regression was the percentage change in the labor force from 1980 to 2000. I chose this variable for its ability to indicate the growth or decay of a commuting zone’s economy through how many people are employed in said economy. In Figure 0.7, scatterplots of the four income quartiles where the percentage change in labor force is plotted on life expectancy.

##### Figure 0.7

```
figure = ggarrange(q1, q2, q3, q4,
labels = c("Q1", "Q2", "Q3", "Q4"))
plot(figure)
```

The regression outputs are as follows, where X is equal to the percentage change in labor force in a commuting zone:

```
**Quartile 1: (LE) = 78.91462 + 1.04686X**
**Quartile 2: (LE) = 82.39775 + 0.40574X**
**Quartile 3: (LE) = 84.47788 + 0.13098X**
```

The slopes of every quartile are positive, indicating a positive relationship between a positive change in labor force participation and life expectancy. Within the first income quartile, a 1% change in the labor force participation rate is predicted to increase life expectancy by 3.821 days, or 1.046 years. Because the results from the third income quartile are not significant, (p = 0.3909), I am only going to explore the results from Q1, Q2, and Q3, which are statistically significant (p = 1.797e-10, p = 0.009416, p = 0.0007074)

To better interpret what these regression equations mean, I am going to compare life expectancy in the first income quartile in two different commuting zones, a zone where there has been negative labor force participation change, and a zone where there has been positive labor force participation change. To do this, I will use the Welch commuting zone in West Virginia with a negative percentage change of -0.3743832, and the Spartanburg commuting zone in South Carolina with a positive percentage change of 0.2851822. Inserting these values into the Q1 regression equation gives us:

```
**Welch: 78.91462 + 1.04686(-0.3743832) = 78.5226932**
**Difference from mean: - 0.3919268**
**Spartanburg: 78.91462 + 1.04686(0.2851822) = 79.21316583**
**Difference from mean: + 0.29854583**
```

These results show that a negative change in labor force participation, or a decline in the number of participants in the labor force in a commuting zone, results in a decrease in life expectancy in the first income quartile. A positive change in labor force participation, or an increase in the number of participants in the labor force in a commuting zone, results in an increase in the average life expectancy in the first income quartile. Because the results are statistically significant, it is safe to make assumptions about the data. An increase in labor force participation rate in a commuting zone can indicate a growing economy, such as more families moving to a commuting zone, therefore more labor force participants. More people working means more people adding to household income and wealth, which can increase their accessibility to healthcare, education, and food.

# Multiple Regression Exploration

The exploratory data analysis consistenly showed evidence of statistically significant correlations within work and health variables in the first income quartile, ranging from the 1st to 25th percentile. To further explore these connections that may affect life expectancy in terms of health and work, I am going to create a multiple regression model that will use other explanatory variables as well as the ones shown prior to more accurately predict life expectancies within the first income quartile.

## Multiple Regression Health Exploration

General data shows that greater expenditures on healthcare result in longer life expectancies, making it clear that health and healthcare play vital roles in living a long life. While we have explored how these variables individually effect life expectancy, I decided to create a multiple regression model to combine these variables of different types. After analyzing the data through simple linear regression models and correlation matrices, I decided to create a multiple regression model that uses the fraction of current smokers in Q1, the fraction of the obese population within Q1, as well as the percentage of uninsured people in 2010 to predict life expectancy in the first income quartile.

I chose these variables not because of their strong r correlational coefficients, but because of their ability to represent different aspects of healthcare. For example, it is a personal decision to smoke, as one is aware of the health consequences. However, it is not always a personal decision to be uninsured. It can be possible one has lost a job, and therefore health benefits, or that they simply cannot afford to pay for health insurance. Exploring both types of variables allows for greater connections between life expectancy and healthcare to be drawn.

#### Correlation Matrix

To investigate other variables within the dataset that have not been yet explored but may have strong associations with life expectancy, I created a correlation matrix in R to find other variables that strongly affect life expectancy. Looking at the results, the variables that hold the strongest associations with life expectancy, positive or negative, are (see Figure 2.1):

**-0.317**Fraction of smokers in the first income quartile**-0.459**Average medicare expenditure per enrollee within a commuting zone**-0.233**30-day mortality rate for heart attack patients

Out of the ten health-related variables in the table, there is only one positive r-value: **0.0786**, the average 30-day mortality rate for heart failure patients. This is a surprising find, suggesting that as the 30-day mortality rate for heart failure patients increases, life expectancy increases. These values tell us that within our dataset of health variables, the strongest correlations are negative, between average medicare expenditure per enrollee in a commuting zone, the fraction of smokers, and the 30-day mortality rate for heart attack patients. This tells us that as all variables increase (disregarding the 30-day mortality rate for heart failure patients value), life expectancy decreases.

Collectively looking at the data, the strongest correlations between predictor variables exist between:

**0.3958**Percentage of uninsured people in 2010 and average Medicare expenditure per enrollee- As the the percentage of uninsured people in 2010 increases, the average Medicare expenditure per enrollee increases

**0.7873**30-day hospital mortality rate index and the 30-day mortality rate for heart attacks- As the 30-day hospital mortality rate index increases, the 30-day mortality rate for heart attacks

**0.2698**30-day hospital mortality rate for pneumonia and the percentage of uninsured people in 2010- As the 30-day hospital mortality rate for pneumonia increases, the percentage of uninsured people in 2010 increases

**0.3037**Fraction of obese people in the first income quartile and average Medicare expenditure per enrollee- As the fraction of obese people in the first income quartile increases, average Medicare expenditure per enrollee increases

These r-values are not in relation to the response variable, but rather, in relation to other predictor variables. However, most of these values are low, which is beneficial for the multiple regression model. Stronger associations between explanatory variables can indicate multicollinearity, which can mean a less accurate model. There are a few cases of multicollinearity within the table, which we should investigate further.

The strongest r-value in the table is **0.7873**. This value indicates that as the 30-day hospital mortality rate index increases, the 30-day mortality rate for heart attacks increases as well. This is a case of multicollinearity, as the two variables are strongly correlated. The 30-day hospital mortality rate accounts for the other 30-day mortality rate values within the table for heart attacks, heart failure, and pneumonia. To rid the model of cases of multicollinearity, the final model will only use the overall 30-day mortality rate index, as it accounts for these three variables.

#### Multiple Regression Model

To create a multiple regression model, I chose to use variables that showed no cases of multicollinearity within the correlation matrix, and variables that showed strong correlation values with life expectancy. Creating a multiple regression model comes with trial and error, so to ensure we create the most accurate model, I used backward step regression in order to create the most accurate model. The outputs of these models are in Figures 2.2, 2.3, and 2.4.

```
**Model 1** (LE) = 83.45 - 3.524(Fraction of Smokers) + 0.4922(Fraction of Obese People) + 2.69(Percentage Uninsured) - 0.0003658(Avg Medicare $ Per Enrollee) - 0.229(30-Day Hospital Mortality Rate Index) - 0.5114(Percent of Medicare Enrollees with at least 1 Primary Care Visit)
**Model 2** (LE) = 83.09 - 3.568(Fraction of Smokers) + 0.4259(Fraction of Obese People) + 2.727(Percentage Uninsured) - 0.0003682(Avg Medicare $ Per Enrollee) - 0.2303(30-Day Hospital Mortality Rate Index)
**Model 3** (LE) = 83.16 - 3.56(Fraction of Smokers) + 2.659(Percentage Uninsured) - 0.0003607(Avg Medicare $ Per Enrollee) - 0.2236(30-day Hospital Mortality Rate Index)
```

In order to create these models, I utilized the backwards regression method. This approach is one that first creates a model that uses all explanatory variables, and eliminates variables one at a time when the p-value is greater than the alpha value. Once I reached Model 3, I stopped eliminating, as all p-values are statistically significant, and the model had the highest adjusted R-squared value of all models. Here, the R-squared value of 0.2873 means that Model 3 explains 28.73% of the data in the regression model. The model itself is statistically significant, meaning we can make assumptions about the results, and that there is a significant association between our final four predictor variables and life expectancy (p = 2.2e-16).

What kind of relationship does this model show between the variables? To analyze these relationships, we can look at the Y-intercept and correlational coefficients. The Y-intercept of **83.16** tells us that when the values of all of the explanatory variables are equal to zero, the predicted life expectancy is 83.16 years. To observe more relationships, we can look at the coefficients. First, it is helpful to use an example commuting zone to interpret the equation overall better. To do this, we are going to use the South Boston commuting zone data in the final regression model. 83.16 – 3.56(0.2524504) + 2.659(0.05015712) – 0.0003607(9840.8985) – 0.2236(-1.130316) = **79.098**

Based on our final multiple regression model, the predicted life expectancy of a person in the first income quartile living in a Boston commuting zone is 79.1 years.

Looking at the model as a whole, we observe that the residual standard error is 0.9408 on 590 degrees of freedom. This means that this model predicts the average life expectancy of a person in the first income quartile with an average error of 0.9408 years. The importance of this value comes from its ability to compare the accuracy of different regression models. For example, when comparing the residual standard error of Model 3 with Model 1, we see that Model 1 predicts the average life expectancy of a person in the first income quartile with an average error of 0.9416 years. This is a greater margin of error than Model 3, telling us that Model 3 is a more accurate model, and therefore, more effective.

#### Residuals

Residuals assess if the estimates and predictions made by the model are biased. Analyzing residuals is a great way to assess if estimates and predictions are biased.

Figure 2.5 shows a histogram of the residuals from Model 3. The histogram shows that the residuals are centered around zero, and that there is a normal distribution. This tells us that the residuals are normally distributed, and that there is no skew or cases of outliers in the model.

The first graph in Figure 2.6 shows a Residuals vs Fitted plot. The data appears to be equally distributed below and above the zero line, but appears to be banded near the center. The red line shows a bit of a curve, but mostly stays near the zero line. The residuals appear to increase positively as the residuals move away from the mean, at around 79.5, and the residuals appear to decrease towards zero as the residuals move towards the mean, which is centered around 79.5.

The second graph in Figure 2.6 shows a Normal Quantile-Quantile plot. The residuals are centered around the line, with a small amount of right skew. There is a left-leaning outlier at the beginning of the plot, but otherwise, this shows that the data follows conditions for normality.

## Multiple Regression Work/Labor Exploration

#### Correlation Matrix

To investigate relationships between other variables and our response variable of life expectancy, I created a correlation matrix in R. Within the results, the variables that hold the strongest associations, negative or positive, with life expectancy are as shows (see Figure 3.1):

**0.2576**Percent change in the labor force from 1980-2000**0.1575**Mean household income**-0.134**Share working in manufacturing

Within the correlation matrix, I employed 10 explanatory variables total, with 6 being directly related to work related matters (similar to variables used in the linear regression analysis). The remaining 4 variables related to the issue of income segregation and inequality, such as the Gini index measure. I chose to include these variables because of the direct relationship between labor and income. Out of the 6 work-related variables, only one negative correlation to the response variable, the share working in manufacturing. Within the linear regression exploration, the share working in manufacturing regression was the singular variable in which a negative slope was found.

Collectively looking at the data, the strongest correlations between predictor variables exist between:

**-0.559**The average unemployment rate in 2000 and the average labor force participation rate- As unemployment rate increases, the labor force participation rate decreases.

**-0.554**The average fraction of people with commuting times over fifteen minutes and income segregation- As the fraction of people with commuting times to work over fifteen minutes increases, income segregation decreases.

**0.0998**The average unemployment rate in 2000 and the share of a population working in manufacturing- As unemployment rate increases, the share of a population working in manufacturing increases

**-0.408**Fractional average of the top 1% income share and average fraction of people with commuting times over fifteen minutes- As the top 1% income share increases, the population of people with commuting times over 15 minutes decreases

**-0.7313**Labor force participation rate and poverty rate- As the labor force participation rate increases, the poverty rate decreases.

While observing the r-values of explanatory variables in relation to other explanatory variables, the values within the “Fraction with Commute < 15 Min” stood out to me. In comparison to any other variable, this variable had the greatest quantity of negative r-values. The strongest correlation overall with the commute variable was with the income segregation variable, with a value of -0.5548. The graph shows a clear decrease in income segregation as the population with shorter commutes increases. The scatterplot below models this relationship:

```
ggplot(final, aes(x = frac_traveltime_lt15, y = cs00_seg_inc)) +
geom_point() +
geom_smooth(method = lm)
```

While observing, I noticed strong associations between variables related to income inequality. For example, the correlation between the average poverty rate and the Gini index in the bottom 99th percentile, (the Gini index is a measure of wealth distribution) was 0.66, one of the strongest overall associations within the data. However, strong associations such as these can indicate cases of multicollinearity, which can result in a less accurate model. However, many of these variables with strong correlations seem to be independent. To ensure numerically that the data does not have cases of multicollinearity, we can find the variance inflation factors (VIF) of the variables in relation to life expectancy. To ensure the data does not have cases of this, we can find the variance inflation factors of the variables in relation to life expectancy.

The below graph is a visualization of the results of the VIF values.

```
modelCheck = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man + hhinc00 +
cs00_seg_inc + gini99 + poor_share + inc_share_1perc + lf_d_2000_1980 +
cs_labforce + frac_traveltime_lt15, data = final)
vif(modelCheck)
vifVal = vif(modelCheck)
View(vifVal)
summary(vifVal)
barplot(vifVal, main = "VIF Values", horiz = TRUE, col = "chartreuse4")
```

The above output shows that there are two variables that are above or near the “threshold” value of 5. A value greater than 5 is indicative of a severe relationship between a predictor variable and other variables within the model, which can decrease the accuracy of the overall model. The variables with significant VIF values are the mean household income in 2000 and the poverty rate. To ensure an accurate model, these two variables will not be used in future calculations. The variable with the next highest VIF is the labor force participation rate, but its value of 3.389 means it is still viable for accurate multiple regression analysis.

#### Multiple Regression Model

With a dataset clear of multicollinearity, we can now create an accurate model with the given data. Here, I again chose to use the backward step regression method in order to create the most statistically significant model. To view the outputs of these models, see Figures 3.3, 3.4, and 3.5. The outputs of these models are as following:

**Model 1:** (LE) = 75.413 + 11.103(unemploymentRate) – 2.710(manufacturingShare) – 13.15(incomeSegregation) + 0.032(giniIndex) + 5.161(1%IncomeShare) + 0.937(laborForceChange) + 5.398(laborForceParticipation) + 0.22662(shortCommute)

**Model 2:** (LE) = 75.4357 + 11.107(unemploymentRate) – 2.712(manufacturingShare) – 13.134(incomeSegregation) + 5.162(1%IncomeShare) + 0.937(laborForceChange) + 5.38(laborForceParticipation) + 0.227(shortCommute)

**Model 3:** (LE) = 75.48 + 11.124(unemploymentRate) – 2.841(manufacturingShare) – 13.66(incomeSegregation) + 5.103(1%IncomeShare) + 0.9138(laborForceChange) + 5.55(laborForceParticipation)

Model 3, the final model, has a residual standard error of 1.012, meaning that the model predicts average life expectancy of a person in the first income quartile with an average error of 1.012 years. This value is important when comparing the accuracy of regression models. The residual standard errors in Model 1 and 2, respectively, were 1.013 and 1.013.

To create an accurate, statistically significant model, I began with all variables. At each step, I eliminated a single variable based on larger than 0.05 p-values. In all, I eliminated the Gini index and unemployment rate variables, leaving the model with a total of 6 predictor variables. In the third and final model, Model 3, all p-values were statistically significant (p = 2.2e-16). The R-squared value of 0.1756 means that 17.56% of the data fits Model 3.

What kind of relationships does the model suggest exist between the variables? Within the model, there are two negative slope values that exist for the percentage in manufacturing population as well as the income segregation variable. There are positive slope values for the percentage change in labor force, labor force participation rate, top 1% income share, and unemployment rate variables.

Now, let’s use the model with a commuting zone example to better understand the equation overall. For this example, we are going to use the Boston, Massachusetts commuting zone in the Model 3 equation:

**Boston Example:** 75.48 + 11.124(0.036) – 2.841(0.12) – 13.66(0.092) + 5.103(0.214) + 0.9138(0.178) + 5.55(0.66)

**= 79.2 years**

Based on the final regression model, the model predicts a person living in Boston in the first income quartile to live for 79.2 years.

#### Residuals

To assess for bias within the model, we can use residuals. The below figure is a histogram of residuals from Model 3.

```
model3 = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man +
cs00_seg_inc + inc_share_1perc + lf_d_2000_1980 +
cs_labforce, data = final)
workResid = resid(model3)
hist(workResid, main = "Histogram of Model 3 Residuals")
```

The residuals are centered at zero, indicating a normal distribution, and that there are no outliers or skew in the model. However, it is important to further evaluate residuals for a full picture of the model. The Residuals vs. Fitted plot in Figure 3.6 show that the data is banded near 79, but evenly distributed above and below the zero line. This apparent grouping between 78 and 80 can be a result of the three outliers that have not before been observed in the data analysis. The Normal Quantile-Quantile plot shows residuals centered around the line, with the same outliers seen before causing left skew. While there are three outliers, the residuals still meet the assumptions needed for multiple regression.

# Conclusion

To conclude the results found through multiple regression, we must go back to our initial hypotheses:

**Geographic areas with less access to health insurance and healthcare as well as higher populations of unhealthy people, such as smokers and obese people, have decreased average life expectancies**

To understand if this hypothesis is true, we look to our final health multiple regression model. The model tells us that 28.73% of the data fit the regression model created, with an average error of 0.94 years, or 343 days, close to a year.

When testing our null hypothesis that geographic areas with less access to healthcare have decreased life expectancies using factors such as the prevalence of health insurance and smoking and obesity rates, we reject the null hypothesis (F = 60.86, p-value = 2.2e-16), as these factors account for and explain 28.73% of variance within life expectancy.

**Geographic areas with lower labor force participation rates, declining economies, and higher populations working in manufacturing have decreased average life expectancies.**

Looking to our final labor multiple regression model, we see that the model accounts for 17.56% of variance with an average standard error of 1.012 years, or 369 days.

When testing our null hypothesis that geographic areas with weaker economies and lower job quality have decreased average life expectancies, we are able to reject the null hypothesis (F = 22.09, p-value = 2.2e-16). These factors are able to account for 17.56% of variance within average life expectancy.

Humans are products of genetics, but also products of their environments. This realization has prompted the question of what factors in areas such as work and health can affect the ultimate outcome for a person, which is life expectancy. By exploring variables within the realms of these two questions, observations about their effects on human lifespan can be made. To answer this question, I have formulate two separate hypotheses for each area, work and health.

# Appendix

*# import libraries and data*
*## NEW DATASET*
**library**(ggplot2)
**library**(Hmisc)
**library**(corrplot)
**library**(ggpubr)
**library**(leaps)
**library**(olsrr)
final = read.csv("~/desktop/lastSet.csv")

## Figure 2.1

```
## le_q1_both cur_smoke_q1 bmi_obese_q1 puninsured2010
## le_q1_both 1.00000000 -0.31722614 -0.15672168 -0.105220919
## cur_smoke_q1 -0.31722614 1.00000000 0.09334105 -0.033256781
## bmi_obese_q1 -0.15672168 0.09334105 1.00000000 0.081624526
## puninsured2010 -0.10522092 -0.03325678 0.08162453 1.000000000
## mort_30day_hosp_z -0.17842438 -0.01965585 0.19193549 0.296073941
## primcarevis_10 -0.14294904 0.15613247 0.23262032 0.004962686
## adjmortmeas_amiall30day -0.23354605 0.02288609 0.23348424 0.299229203
## reimb_penroll_adj10 -0.45946478 0.23855328 0.30372024 0.395783327
## adjmortmeas_pnall30day -0.20703622 0.02743093 0.24202417 0.269819905
## adjmortmeas_chfall30day 0.07861943 -0.11364151 -0.08627204 0.070133719
## mort_30day_hosp_z primcarevis_10
## le_q1_both -0.17842438 -0.142949039
## cur_smoke_q1 -0.01965585 0.156132474
## bmi_obese_q1 0.19193549 0.232620319
## puninsured2010 0.29607394 0.004962686
## mort_30day_hosp_z 1.00000000 0.064954040
## primcarevis_10 0.06495404 1.000000000
## adjmortmeas_amiall30day 0.78733012 0.069136537
## reimb_penroll_adj10 0.06539899 0.181809340
## adjmortmeas_pnall30day 0.76503413 0.107289716
## adjmortmeas_chfall30day 0.69470222 -0.048158934
## adjmortmeas_amiall30day reimb_penroll_adj10
## le_q1_both -0.23354605 -0.45946478
## cur_smoke_q1 0.02288609 0.23855328
## bmi_obese_q1 0.23348424 0.30372024
## puninsured2010 0.29922920 0.39578333
## mort_30day_hosp_z 0.78733012 0.06539899
## primcarevis_10 0.06913654 0.18180934
## adjmortmeas_amiall30day 1.00000000 0.23113923
## reimb_penroll_adj10 0.23113923 1.00000000
## adjmortmeas_pnall30day 0.37405466 0.17740663
## adjmortmeas_chfall30day 0.35523656 -0.33478321
## adjmortmeas_pnall30day adjmortmeas_chfall30day
## le_q1_both -0.20703622 0.07861943
## cur_smoke_q1 0.02743093 -0.11364151
## bmi_obese_q1 0.24202417 -0.08627204
## puninsured2010 0.26981990 0.07013372
## mort_30day_hosp_z 0.76503413 0.69470222
## primcarevis_10 0.10728972 -0.04815893
## adjmortmeas_amiall30day 0.37405466 0.35523656
## reimb_penroll_adj10 0.17740663 -0.33478321
## adjmortmeas_pnall30day 1.00000000 0.30486138
## adjmortmeas_chfall30day 0.30486138 1.00000000
```

## Figure 2.2 – Health MR Model 1

*## Multiple Regression Model 1*
hlth1 = lm(le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z + primcarevis_10, data = final)
summary(hlth1)

```
##
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 +
## reimb_penroll_adj10 + mort_30day_hosp_z + primcarevis_10,
## data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5706 -0.6231 0.0322 0.5838 3.1122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.345e+01 6.256e-01 133.405 < 2e-16 ***
## cur_smoke_q1 -3.524e+00 6.122e-01 -5.756 1.39e-08 ***
## bmi_obese_q1 4.922e-01 5.739e-01 0.858 0.39142
## puninsured2010 2.690e+00 8.946e-01 3.007 0.00275 **
## reimb_penroll_adj10 -3.658e-04 3.366e-05 -10.865 < 2e-16 ***
## mort_30day_hosp_z -2.290e-01 4.399e-02 -5.206 2.67e-07 ***
## primcarevis_10 -5.114e-01 7.706e-01 -0.664 0.50720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9416 on 588 degrees of freedom
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2861
## F-statistic: 40.68 on 6 and 588 DF, p-value: < 2.2e-16
```

## Figure 2.3 – Health MR Model 2

*## Multiple Regression Model 2*
hlth2 = lm(le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
summary(hlth2)

```
##
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + bmi_obese_q1 + puninsured2010 +
## reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6091 -0.6354 0.0285 0.5806 3.1081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.309e+01 2.997e-01 277.249 < 2e-16 ***
## cur_smoke_q1 -3.568e+00 6.083e-01 -5.865 7.51e-09 ***
## bmi_obese_q1 4.259e-01 5.649e-01 0.754 0.45116
## puninsured2010 2.727e+00 8.924e-01 3.056 0.00235 **
## reimb_penroll_adj10 -3.682e-04 3.345e-05 -11.009 < 2e-16 ***
## mort_30day_hosp_z -2.303e-01 4.393e-02 -5.242 2.21e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9411 on 589 degrees of freedom
## Multiple R-squared: 0.2928, Adjusted R-squared: 0.2868
## F-statistic: 48.77 on 5 and 589 DF, p-value: < 2.2e-16
```

## Figure 2.4 – Health MR Model 3

*## Multiple Regression Model 3*
hlthmr3 = lm(le_q1_both ~ cur_smoke_q1 + puninsured2010 + reimb_penroll_adj10 + mort_30day_hosp_z, data = final)
summary(hlthmr3)

```
##
## Call:
## lm(formula = le_q1_both ~ cur_smoke_q1 + puninsured2010 + reimb_penroll_adj10 +
## mort_30day_hosp_z, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7311 -0.6328 0.0257 0.6009 3.0756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.316e+01 2.867e-01 290.072 < 2e-16 ***
## cur_smoke_q1 -3.560e+00 6.080e-01 -5.856 7.89e-09 ***
## puninsured2010 2.659e+00 8.876e-01 2.996 0.00285 **
## reimb_penroll_adj10 -3.607e-04 3.192e-05 -11.302 < 2e-16 ***
## mort_30day_hosp_z -2.236e-01 4.300e-02 -5.200 2.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9408 on 590 degrees of freedom
## Multiple R-squared: 0.2921, Adjusted R-squared: 0.2873
## F-statistic: 60.86 on 4 and 590 DF, p-value: < 2.2e-16
```

## Figure 2.5

## Figure 2.6

## Figure 3.1

```
corwork = cor(final[,c(7, 35, 36, 37, 38, 41, 42, 43, 44, 45, 46)])
corwork
```

```
## le_q1_both unemp_rate cs_elf_ind_man hhinc00
## le_q1_both 1.00000000 0.02683460 -0.13457732 0.15749949
## unemp_rate 0.02683460 1.00000000 0.09981988 -0.40651943
## cs_elf_ind_man -0.13457732 0.09981988 1.00000000 -0.06640949
## hhinc00 0.15749949 -0.40651943 -0.06640949 1.00000000
## cs00_seg_inc -0.10975250 -0.19492736 -0.26342686 0.53429839
## gini99 -0.09618248 0.32751141 -0.15013276 -0.24749427
## poor_share -0.06992382 0.53503545 -0.12516182 -0.74643654
## inc_share_1perc 0.12002690 -0.06919677 -0.09526483 0.37326964
## lf_d_2000_1980 0.25758086 -0.03901457 -0.22666968 0.30024574
## cs_labforce 0.10753993 -0.55911576 0.04707427 0.56591420
## frac_traveltime_lt15 0.10198220 -0.10299322 -0.13503744 -0.33073218
## cs00_seg_inc gini99 poor_share inc_share_1perc
## le_q1_both -0.1097525 -0.09618248 -0.06992382 0.12002690
## unemp_rate -0.1949274 0.32751141 0.53503545 -0.06919677
## cs_elf_ind_man -0.2634269 -0.15013276 -0.12516182 -0.09526483
## hhinc00 0.5342984 -0.24749427 -0.74643654 0.37326964
## cs00_seg_inc 1.0000000 0.24796714 -0.18248018 0.44173253
## gini99 0.2479671 1.00000000 0.66645370 0.22414544
## poor_share -0.1824802 0.66645370 1.00000000 -0.04099467
## inc_share_1perc 0.4417325 0.22414544 -0.04099467 1.00000000
## lf_d_2000_1980 0.2973579 0.13360412 -0.15260207 0.31959125
## cs_labforce 0.3684665 -0.52046954 -0.73138557 0.09401360
## frac_traveltime_lt15 -0.5548548 -0.46882961 -0.01549800 -0.40880016
## lf_d_2000_1980 cs_labforce frac_traveltime_lt15
## le_q1_both 0.25758086 0.10753993 0.10198220
## unemp_rate -0.03901457 -0.55911576 -0.10299322
## cs_elf_ind_man -0.22666968 0.04707427 -0.13503744
## hhinc00 0.30024574 0.56591420 -0.33073218
## cs00_seg_inc 0.29735795 0.36846653 -0.55485478
## gini99 0.13360412 -0.52046954 -0.46882961
## poor_share -0.15260207 -0.73138557 -0.01549800
## inc_share_1perc 0.31959125 0.09401360 -0.40880016
## lf_d_2000_1980 1.00000000 0.17409963 -0.33991876
## cs_labforce 0.17409963 1.00000000 0.06786871
## frac_traveltime_lt15 -0.33991876 0.06786871 1.00000000
```

## Figure 3.2

```
modelCheck = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man + hhinc00 +
cs00_seg_inc + gini99 + poor_share + inc_share_1perc + lf_d_2000_1980 +
cs_labforce + frac_traveltime_lt15, data = final)
vif(modelCheck)
vifVal = vif(modelCheck)
View(vifVal)
summary(vifVal)
```

## Figure 3.3

```
model1 = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man +
cs00_seg_inc + gini99 + inc_share_1perc + lf_d_2000_1980 +
cs_labforce + frac_traveltime_lt15, data = final)
summary(model1)
```

```
##
## Call:
## lm(formula = le_q1_both ~ unemp_rate + cs_elf_ind_man + cs00_seg_inc +
## gini99 + inc_share_1perc + lf_d_2000_1980 + cs_labforce +
## frac_traveltime_lt15, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4725 -0.6288 -0.0083 0.6467 2.7766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.41354 1.01783 74.093 < 2e-16 ***
## unemp_rate 11.10321 3.31893 3.345 0.000874 ***
## cs_elf_ind_man -2.71086 0.68090 -3.981 7.72e-05 ***
## cs00_seg_inc -13.15103 2.26098 -5.817 9.89e-09 ***
## gini99 0.03264 1.10699 0.029 0.976485
## inc_share_1perc 5.16104 1.21022 4.265 2.33e-05 ***
## lf_d_2000_1980 0.93740 0.17942 5.224 2.43e-07 ***
## cs_labforce 5.39895 1.23034 4.388 1.36e-05 ***
## frac_traveltime_lt15 0.22662 0.60351 0.376 0.707421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.013 on 586 degrees of freedom
## Multiple R-squared: 0.1841, Adjusted R-squared: 0.173
## F-statistic: 16.53 on 8 and 586 DF, p-value: < 2.2e-16
```

## Figure 3.4

```
model2 = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man +
cs00_seg_inc + inc_share_1perc + lf_d_2000_1980 +
cs_labforce + frac_traveltime_lt15, data = final)
summary(model2)
```

```
##
## Call:
## lm(formula = le_q1_both ~ unemp_rate + cs_elf_ind_man + cs00_seg_inc +
## inc_share_1perc + lf_d_2000_1980 + cs_labforce + frac_traveltime_lt15,
## data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4753 -0.6282 -0.0071 0.6460 2.7762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.4357 0.6856 110.031 < 2e-16 ***
## unemp_rate 11.1072 3.3133 3.352 0.000853 ***
## cs_elf_ind_man -2.7124 0.6784 -3.998 7.19e-05 ***
## cs00_seg_inc -13.1339 2.1831 -6.016 3.15e-09 ***
## inc_share_1perc 5.1625 1.2081 4.273 2.25e-05 ***
## lf_d_2000_1980 0.9377 0.1790 5.239 2.26e-07 ***
## cs_labforce 5.3803 1.0535 5.107 4.43e-07 ***
## frac_traveltime_lt15 0.2227 0.5885 0.378 0.705212
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.013 on 587 degrees of freedom
## Multiple R-squared: 0.1841, Adjusted R-squared: 0.1744
## F-statistic: 18.93 on 7 and 587 DF, p-value: < 2.2e-16
```

## Figure 3.5

```
model3 = lm(le_q1_both ~ unemp_rate + cs_elf_ind_man +
cs00_seg_inc + inc_share_1perc + lf_d_2000_1980 +
cs_labforce, data = final)
summary(model3)
```

```
##
## Call:
## lm(formula = le_q1_both ~ unemp_rate + cs_elf_ind_man + cs00_seg_inc +
## inc_share_1perc + lf_d_2000_1980 + cs_labforce, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4615 -0.6400 -0.0004 0.6411 2.7640
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.4800 0.6750 111.820 < 2e-16 ***
## unemp_rate 11.1249 3.3105 3.360 0.000829 ***
## cs_elf_ind_man -2.8415 0.5859 -4.850 1.59e-06 ***
## cs00_seg_inc -13.6698 1.6605 -8.232 1.19e-15 ***
## inc_share_1perc 5.1033 1.1971 4.263 2.35e-05 ***
## lf_d_2000_1980 0.9138 0.1674 5.460 7.04e-08 ***
## cs_labforce 5.5504 0.9521 5.830 9.17e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.012 on 588 degrees of freedom
## Multiple R-squared: 0.1839, Adjusted R-squared: 0.1756
## F-statistic: 22.09 on 6 and 588 DF, p-value: < 2.2e-16
```

## Figure 3.6

`plot(model3)`

RPubs – Life Expectancy Multiple Regression Project Written by **Simran Cheema**

**How Do I Calculate Life Expectancy?**