Close this search box.
Close this search box.

Can you predict housing market?

Can you predict housing market? The dataset for our research is retrieved directly from the corresponding  Kaggle competition. After preprocessing and cleaning the dataset. We implemented  different regression models and analyzed their performances. Models used in this  project include Linear Regression, Random  Forest Regression and Gradient Boosting  with GridSearchCV. 

1. Introduction 

For a home buyer, what are the likely  factors that will influence their decision?  How much will the client be willing to pay?  Failure to accurately predict house prices  may lead to economic loss in real estate  investment and lower efficiency in housing sales.  

We analyzed this dataset of 79 explanatory variables describing (almost)  every aspect of residential homes in Ames,  and aimed to identify the most influential  yet usually neglected factors for the house pricing evaluation process. 

The Kaggle source contains two datasets: train.csv & test.csv. Each csv file  has 1460 entries with 80 (81) columns, with  37(38) numerical variables and 43  categorical variables. The dependent  variable is the ‘SalePrice’ and is only  present in the training file.  

One of the biggest tasks of this project is  to consolidate these various features and  reduce noise. This is done first during the data pre-processing phase. And later revisited using the feature importance list of the  Gradient Boosting regressor. 

Another goal of this project is to compare different machine learning models  and their results, while finding the best one  to perform our task.  

In the end, we looked at our initial  assumption of the most important features  and those given by our models. While some  similarities do certainly exist, the differences  (given by the model) could be our hidden  gems (i.e., the neglected indicators of the  housing sale price).

2. Data Cleaning and Visualization

The dataset contains 1460 rows and 80  columns. Nineteen columns have missing  values. Also, there are 43 categorical  variables that need to be transformed into  dummy variables, some variables have a  low correlation with the dependent variable,  which likely represents low predictive  power and should be dropped. We will  modify these before we jump into actual  modeling and analysis. 

2.1 Missing Value and Dummy Variables

Since we are short on sample size, we  want to retain as much information as we  can. Therefore, we separated the missing  values into two cases: meaningful and meaningless.  

For variables where NaNs indicate an  absence of feature (e.g., no garage). Furthermore, we convert them to either strings (i.e., “NA”) or  integers (i.e., 0) depending on the nature of  the variable (i.e., categorical or numerical). 

Moreover, for those where the meaning of a NaNs  is not determined, we simply used the mean  value for the numerical ones; for the  categorical ones, since it was difficult to  pick a value, given the small size (~10 rows) we just dropped them all. Lastly, we ended up with 1459 non-null rows. 

Finally, to convert the categorical  variables to numerical, we created 266 dummy variables and ended with 303  columns in total. 

2.2 Examining Correlations 

Variables that have a low correlation  with the DV will bring in additional noise  during analysis and are thus undesirable.  

As shown by this heat map, most of the  column pairs have very low correlations.

We dropped all columns that had a  correlation of less than 0.05 with SalePrice  (our DV). 110 variables were dropped. 193 variables with relatively higher predictive  power remained. 

3. Modeling and Analysis 

For all machine learning models in our  analysis, we used 70% of train.csv to train  and the rest (30%) to test and modify our  models. Models include Linear Regression,  Random Forest Regression, Gradient  Boosting, with GridSearchCV.

We used the Root mean squared log loss  and the R2 score to compare the modeling  performances. After finding the best model,  we applied PCA analysis to reduce features.  

We found that the Linear Regression  model had an R2 score of 0.41, while  Random Forest and Gradient Boosting both  had higher R-Square scores: 0.85 and 0.89  respectively.  

For both the Random Forest Regressor  and the Gradient Boosting Regressor, we  implemented GridSearchCV to find the  optimal set of parameters for the models,  which were able to improve our results  slightly. Finally, we also tried PCA and  successfully reduced the number of features.  

3.1 Linear Regression 

First, we used the Linear Regression  model to fit the data and tried to directly use all the numerical and categorical variables to  explain the house sales price.  

The linear regression model returned an  R2 of 0.41, which is not ideal. We quickly  moved on to the random forest models. 

3.2 Random Forest Regression 

Random Forest is an ensemble classifier,  where each tree is created using only a  subset of the predictors (i.e., attributes), this prevents overfitting of the data (since only a  small part is used by each tree) and thus  

improves the accuracy of the entire model. For Random Forest, the R2 obtained is  0.85 (average of 50 runs), which is much  better than that of the LR model. The root  mean squared log loss is 0.15.  

3.2.1 GridSearchCV 

Using GridSearchCV, we tried different  values for certain parameters of the Random  Forest Regressor, including the number of  estimators, maximum depth of tree,  minimum sample size required to split at an  internal node, and the minimum size of a  leaf node. The optimal set of parameters  successfully improved our R2 to 0.87. 

3.3 Gradient Boosting Regression Gradient boosting is another popular  ensemble method. The main difference  between Gradient Boosting and Random  Forest is that Gradient Boosting uses older  trees to help correct the newer ones, while  Random Forest simply builds each tree  independently. 

With careful tuning of parameters,  Gradient Boosting should end up with  higher performance (e.g., accuracy in our  case) than Random Forest. 

The R2 of the Gradient Boosting Regressor is 0.88, which is indeed higher  than the Random Forest. The root mean  squared log loss is 0.13, which is also better. 

After using GridSearchCV in a similar  fashion, we were able to improve our R2  score to 0.89. 

3.4 Feature Importance and PCA

The feature importance of the Random  Forest model is shown in Appendix 1. We  can see that overall quality, ground living  area, and second-floor area are among the  most important features in predicting prices. On the other hand, the feature importance of the Gradient Boosting  Regressor is shown in Appendix 2. Here,  overall quality still plays the most important  role in estimating house prices, which makes  intuitive sense. After all, an overall quality  score given by presumably housing agencies  is likely going to be a good indicator of the  house’s sale price. 

Finally, comparing the 10 most important features obtained from the  Gradient Boosting model with our initial  assumptions (See Appendix 3), we see that  half of them matched. The other half could  potentially be the neglected predictors that  are valuable (e.g., number of fireplaces, lot  size, style of dwelling, etc.) 

Given the large number of attributes, we  wondered if we could reduce noise by using  techniques such as PCA to reduce the  number of features. While we were able to  reduce the number of features, we discovered that the accuracy of the model  was compromised.

This is likely due to the  high predictive power of the leading  attributes (e.g., overall quality). We ended  up not using PCA to preserve the performance of our existing models. 

3.5 Comparison Between the Models

The advantage of the linear regression  model is the high interpretability, while the  disadvantage is the low accuracy. 

Random Forest and Gradient Boosting  both have higher accuracy and experience  no multicollinearity issue. The only  downside is that both are relatively hard to  interpret, and that the Gradient Boosting  algorithm runs much slower. 

(See Appendix 4) 

4. Kaggle Submission Result 

The Kaggle Competition (URL in  Appendix 6) uses Root Mean Squared  Logarithmic Error as the scoring criterion. 

Our final submission scores 0.13751 and  ranks 1920 out of 5000+ submissions.  (Appendix 5) For comparison, the Sample  Benchmark Submission scores 0.40613,  ranking 4500. 

A score of 0.11 would place roughly  around the top 100, which shows that our  prediction error is just slightly larger than  that of the top models.


1. Feature Importance of Random Forest Regressor 

2. Feature Importance of Gradient Boosting Regressor

3. Comparison of Feature Importance between Initial Assumption and Model Results

4. Model Performances

Can you predict housing market?