Final Project-Selecting Models to Predict CHD

Ziheng Pan
3 min readMay 8, 2021

I selected a dataset from a cardiovascular study on residents of the town of Framingham, Massachusetts. My primary goal was to find an optimal model to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).

The dataset has over 4000 observations, with 15 attributes and 1 target variable (TenYearCHD). The data are filled with numeric values so I did not have to generate more dummies.

I firstly performed data cleansing. I noticed that there were some NA entries so I dropped observations with them. I then normalized the ranges of features in the dataset.

# of missing values in each feature
the normalized dataset

I split the data into test and training data with a test size of 30% and got a baseline accuracy of 0.848. After that, I used a logistic regression model and yielded a slightly higher accuracy of 0.859.

the confusion matrix of the logistic regression model

I then built a classification tree, but this gave a lower accuracy of 0.846. I looked at the feature importances of this model and displayed the top three important features. The leading one is age, with an importance of 0.4, which suggests that the model might not fit very well since age should not be that influential for CHD in real.

the confusion matrix of the decision tree model
feature importances of the decision tree model

Next, I built a generic Bagging ensemble, which gave the lowest accuracy of 0.836. The random forest model and the AdaBoost model displayed similar accuracy scores (0.844 v.s. 0.840), not desirable yet. But looking at their feature importances, age is now out of the top 3 most influential ones, which might suggest an improvement in the fitness of the model.

feature importances of the AdaBoost model
feature importances of the random forest model

I then tried a voting ensemble of RandomForestClassifier, DecisionTreeClassifier, Support Vector Machine and Logistic Regression, but the accuracy score still failed to reach 0.85 (0.849).

When I switched to a neural network model, the accuracy score reached the highest (0.860), which motivated me to select this one.

I adjusted the parameters for the neural network model, with an alpha of 1e-5, which yielded a slightly lower accuracy (0.854). I realized that there might be some over-fitting issues with the model. Therefore, I simplified the model using two dropouts of 10% probability at the hidden layers, which generated a lower loss compared to that without the dropout. Finalized with the dropout, I think the model was now optimal to use.

--

--