What is the deadliest heart disease?
The term “heart disease” refers to several types of heart conditions. Types of heart disease include blood vessel disease, heart rhythm problems, heart defects you’re born with, heart valve disease, disease of the heart muscle, or heart infection. The most common type of heart disease in the United States is coronary artery disease, which is a type of blood vessel disease that affects the blood flow to the heart. As science continues to evolve, so does health care. Heart disease has been no exception to these advancements.
Over the last 50 years, the occurrence of heart disease has been on the decline overall. Even with this decline, the disease is still the leading cause of death for men, women, and people of most racial and ethnic groups with around 1 in every 4 deaths, or around a total of 655,000 people, every year in the United States. Seeing this drastic portion of the deaths in Americans led our group to acquire an interest in learning more about heart disease and its most important predictors through the use of our data-set.
The first question Group 9 posed was “Can we make a model to predict the condition of heart disease in a patient based on the data set?” Most of our original questions revolved around variables and how they were correlated with the condition. We chose this question because finding a model with the best prediction for heart disease would allow us to see which variables exactly influence the disease the most.
Moreover, our group became interested in creating this model. Because having the best predictive variables lets the process of predicting become streamlined. As some bad variables become removed, the results become more accurate with fewer variables required. The owner of this data might find it useful if collecting new data around the appearance of heart disease. When gathering new data they might only compile data on the variables that we find. As influencing the condition prediction the most. Also, this model will allow people to make changes where they can to improve certain variables within themselves. To reduce the possible formation of heart disease.
The second question Group 9 posed was “Is it possible to accurately predict the maximum heart rate based on the existing predictors in the data set?” Our fascination with trying to find the best model to predict the condition led us to create another question about model predicting. In our exploratory data analysis, we had many questions revolving around the variable of maximum heart.
One of our original questions found that age seemed to be negatively correlated to the maximum heart rate. This led Group 9 to wonder what variables, if any, would make it possible to predict a patients maximum heart rate. If there is a way to accurately predict maximum heart rate, this will be beneficial in creating more ways to predict each variable. If there isn’t a way to predict, then this shows that not every variable can be predicted to an accurate enough degree to where it is useful.
By answering these questions we may be closer to understanding how heart disease is caused and how we can use data to predict other variables. These questions will allow us to dive deeper into the data and discuss our findings on a greater level. Heart disease is unfortunately not going away anytime soon. These models of prediction will be extremely useful and very informative to those studying the condition. Streamlining and consolidating the variables leads to greater analysis within the remaining predictors, saving time and lives for years to come.
The data that our group decided to use was found on Kaggle. The data on Kaggle became initially collected by UCI Machine Learning Repository and included 76 variables at first. However, the data we used only included 14 of those variables. The creators of this source are Andras Janosi M.D., from the Hungarian Institute of Cardiology in Budapest, William Steinbrunn M.D., from University Hospital in Zurich, Switzerland, Matthias Pfisterer M.D., from University Hospital in Basel Switzerland, and Robert Detrano, M.D, Ph.D. from the V.A. Medical Center, Long Beach and Cleveland Clinic Foundation. The data was collected by these four individuals at their seperate locations, but This dataset has 297 observations and because of the nature of our questions. Specifically the first one where we want to find the best model to predict whether a patient has heart disease we used all of the variables in our investigation.
Following are what each variable is and what they mean in the real world. ‘Age’ is straightforward because it is just the how many years the patient has been alive, in our dataset the ages range from 29 to 77.
Sex is a categorical variable in the initial data with 1 being male and 0 being female. ‘cp’ is a categorical variable that categorizes the type of chest pain an individual has and it has 4 types, where 0 means typical angina, 1 means atypical angina, 2 means non-anginal pain, and 3 means asymptomatic. ‘trestbps’ becomes the resting blood pressure in mm Hg. When they were initially admitted at the hospital. ‘chol’ is the serum cholestoral in mg/dl. ‘fbs’ is a binary variable that is 1 if the individuals blood sugar was greater than 120 mg/dl and 0 if it was less than 120 mg/dl. ‘restecg’ is a categorical variable which is a description of the resting electrocardiographic results, where 0 is a normal result.
1 means that there is an abnormality in the ST-T wave, and 2 means there is a probable or definite left ventricular hypertrophy. ‘thalach’ becomes the maximum heart rate achieved by the patient. ‘exang’ is a binary variable where 1 means the patient has exercise induced angina, and 0 means they do not. ‘oldpeak’ the measure of ST depression induced by exercise relative to rest. ‘slope’ is a categorical variable that categorizes the slope of the peak exercise ST segment where 0 means it has an upwardslope, 1 means it is flat, and 2 means that it has a downward slope. ‘ca’ is the number from 0 to 3 of major vessels that a patient has
that are colored by flourosopy. ‘thal’ is a categorical variable where means that blood flow is normal, 1 means that there is a fixed defect in the blood flow found, and 2 means that there is a reversible defect found in the blood flow. And lastly ‘condition’ is a binary variable, where 1 indicates that the patient has heart disease and 0 means that they do not have heart disease.
This table has become adjusted so that the numbers in the initial data set for categorical variables are now what those numbers actually represent. For instance, male and female are now the values of the sex variable instead of 1 and 0, respectively. This table gives a brief glimpse at the data that we used only showing the first 6 patients from the data set. This allows us to get an idea of what some of the patients in the data set are like.
What is the deadliest heart disease?
This figure illustrates the difference in maximum heart rate between male and female patients and the difference between patients with heart disease and without heart disease. Patients with heart disease, on average had a lower maximum heart rate, compared to patients without heart disease, regardless of gender. This figure shows that there could be a relationship between the maximum heart rates between those
To answer the first question about finding the best model to predict the condition of heart disease, we first randomly split about 80% of the data set into the training set (241 observations) and about 20% of the data set into the testing set (56 observations) with a seed of 320. Since our response variable is a binary categorical variable, we will build logistic models based on the training set and evaluate them using the testing set. We consider the rest of the 13 variables to be potential predictors.
Furthermore, to select the best model out of these predictors, we use the bestglm() function in the bestglm package, which selects the top models with the least Bayesian Information Criterion scores (BIC). BIC score provides an estimate for the model performance from a Bayesian statistics viewpoint. It is important to realize that BIC criterion selects a range of models that are reasonable, which should be viewed as guidelines instead of absolute rules. Therefore, we will fit the training models selected by bestglm() on the testing set and compare the accuracy, sensitivity, specificity, false positive, and false negative rates of the predictions to select the best model.
After applying bestglm() on all 13 potential predictors, we have the following 5 models with the least BIC scores as shown by the table. The mark represents that the specific model includes that predictor. The predictors not shown in the table, not included by any of the 5 models. We can see all 5 models include sex, chest pain, slope, major vessels, and thal as predictors.
Now, with the models selected, we calculate the accuracy, sensitivity, specificity, false positive, and false negative rates based on the predictions obtained by fitting training models on the testing set. The comparison table below. We can see that Model2 has the highest accuracy, sensitivity, specificity with lowest false positive and false negative rates. The accuracy of 0.8392857 for Model2 is relatively high and indicates good prediction.
The following visual shows the ROC curve for Model2; it demonstrates the performance of the model under different classification thresholds/cutoffs. The area under the curve is 0.936, which is very high and indicates good performance. Based on the comparison table and the area under the ROC curve, we choose Model2 as our best model to predict the condition of heart disease. Model2 includes predictors sex, chest pain, exercise induced angina, slope, major vessels, and thal; the coefficient estimations of Model2 below table as well.
To answer the second question about “Is it possible to accurately predict the maximum heart rate?”, we first utilized the train set we made in Q1, and mutate this set to TrainQ2 to make sure response variable is at the last column of input data frame. Since our response variable is a numerical variable, we will build linear regression models based on the training set and evaluate them using the testing set. To select the best model out of these predictors, we use the bestglm() function in the bestglm package, which selects the top models with the least Bayesian Information Criterion scores (BIC).
After applying bestglm() on all 13 potential predictors, we have the following 5 models with the least BIC scores as shown by the table. The mark represents that the specific model includes that predictor. The predictors not shown in the table are not included by any of the 5 models. We can see all 5 models include “age”, “Resting Blood Pressure”, “Serum cholestoral”, “Excercise Induced Angina”, “Slope”, “Condition” as predictors.
Now, with the models selected, we calculate the MSE and MAE based on the predictions obtained by fitting training models on the testing set. The comparison table shown below. We can see that Model3 is the best among these models. For both MSE and MAE, lower value represents better model. With the lowest MSE and MAE models, Model3 fits the data better. Model3 includes predictors age, Resting Blood Pressure, exercise induced angina, slope, and condition;
The following visual shows the Fitted vs Actual plot of these five models. From the previous table, we find that Model3 is slighter better than other models. However, through the Fitted vs Actual plot, we find that there is no big difference between these models. Also, points from each model do not vary around the red line, which illustrates that all these five models did not fit the data well. Since the linear regression model can’t predict the maximum heart rate well, we hope that we can use other models to accurately predict the maximum heart rate.
Then, we decide to use Shrinkage method to build a better model. We use different alpha value to calculate lambda and MSE values. As we mentioned before, the model is better with lower MSE value. We find that all models from Shrinkage method have smaller MSE values than those of linear regressional models. However, the lowest MSE values of models built by shrinkage model is still nearly 400, which is quite big.
For our first question, we intended to find the best model to predict the condition of whether a person has heart disease or not. The best model turn out to be a logistic regression model which includes sex, chest pain, exercise induced angina, slope, major vessels, and thal as predictors. This model has an accuracy rate of 83.92%. This model did not surprise us or came as unusual because the predictors, sex, chest pain, exercise induced angina, slope, major vessels, and thal, are common indicators for heart conditions in terms of medical observations. In fact, all of our best 5 models contain sex, chest pain, slope, major vessels, and thal as predictors.
Moreover, this result shows how important it is to monitor physical indicators as things like chest pain and thal are observable in daily life. The fact that this result came from a statistical perspective rather than complex medical theories and anyone with similar datasets could come to similar conclusions about the importance of various physical indicators makes the idea relatable. If we had the chance to further research this topic, we might need a new dataset containing the predictors of our model as variables and apply our model to some new datasets in order to predict people’s heart conditions and to give them suggestions and warnings.
For our second question, we intended to find out whether it is possible to accurately predict the maximum heart rate based on the existing predictors in the data set or not. We first applied the bestglm() method to the dataset and the result showed that the model with predictors, age, Resting Blood Pressure, exercise induced angina, slope, and condition turned out to be the best model. However, even though this model is the best logistic linear regression model among all the models that bestglm() gave us, it still couldn’t predict the maximum heart rate very well.Then we tried the shrinkage method hoping to improve our model.
We found that although all models generated from Shrinkage method have smaller MSE values than those of linear regression models, the lowest MSE values of models built by shrinkage model was still too high to fit the actual data well. Therefore, we conclude that because the dataset becomes built around the variable of heart condition, we can’t use the same set of data as predictors to predict the maximum heart rate. We wonder that if we had a bigger dataset with more interaction terms in the dataset, can we predict the maximum heart rate variable? This question can become further investigated in the future once we have an ideal dataset.