What is the median house price in Boston suburbs?

What is the median house price in Boston suburbs?

Real Estate

Ibuyer Business Model : A Threat To Agents?

Blockchain Real Estate : Smart Contracts and their potential impact on RE!

Citadel’s 425 Park Ave : An Iconic Real Estate Investment?

Nobel Prize Winning Economist & Stanford Professor Paul Romer on Hyperinflation & Protecing Science

Boston Suburbs Home Values Statistics : Net Worth Of Homebuyers

This report is devised to study the Boston housing dataset collected by the U.S Census Service and investigate the correlation between the variable “MEDV” and other attributes in the dataset. In approaching this task, the research questions are subdivided into several aspects to provide a comprehensive illustration. The report starts with an IDA and model selection. Assumption checking and discussion of the result are included as the later steps. With the use of multiple regression, we have been able to generate the model of ‘MEDV’ and achieved an adjusted R-squared of 0.78 and root-mean-square error of 0.19 as our formal findings. 


Background. In the first phase, an initial exploration of the attribute “MEDV” and its relationships with other associated variables would give us an overview of the dataset. As we progress through several experiments with modeling, a simple and easily interpretable “MEDV” model with less insignificant variables has been produced by stepwise regression. The model selection step has provided us with insight into intricate correlations between “MEDV” and other variables, allowing the model to be improved to a statistically significant extent. 

Meanwhile, we have developed an interest in studying the key assumptions underlying our multiple regression of the US census service, restricting the range of price to 0-50(k). This leads to the prices of any houses whose value were more than 50k to be unknown and studies have proven there have been houses that were valued more than 50k(3). Hence our study of the ‘MEDV’ variable is restricted due to the ceiling of ‘50.’ 

Moreover, some of the house values have been incorrectly recorded, this is also evidenced by the study of Otis W. Gilley(3) which also limits the accuracy of our ‘MEDV’ model. 


Assumptions. Assumption checking was carried out before and after performing variable selection to guarantee validity. Residuals are symmetrically distributed above and below zero, thus the linear assumption is reasonable. The house from Boston is not related to one another, which follows the Independence assumption. There are some outliers in the Residuals vs Fitted graph, therefore we use the central limit theorem. The residuals do not appear to be fanning out or changing their variability over the range of the fitted values so the homoscedasticity assumption is met. Residuals QQ plot shows that most points are close to the straight line, also relying on the central limit theorem, the normality assumption is satisfying model.

The assumptions including linearity, independence, homoscedasticity. And normality have all been scrupulously examined to ensure our model is valid and reliable.

The simplified model has provided sufficient information to predict how a proportion of change in each of the other variables has reflected on our target variable “MEDV” while holding the remaining variables constant as the answer to our main research questions towards how different variables interactwith “MEDV.” In the concluding step, we evaluate several steps and performance indicators of the model to analyse the model’s efficiency. 

Data Set. The origin of the Boston housing data is Natural, this publicly available dataset contains information collected by the U.S Census Service concerning housing in the area of 

Boston Mass. There are 14 columns and 506 rows in this data set, and all of the attributes are numerical. All of the variables are towards describing a Boston suburb or town. The first analysis of the dataset was made by David Harrison Jr. and Daniel L. Rubinfeld, called Hedonic housing prices and the demand for clean air. . . (2) 

Our target variable “MEDV” describes the Median value of owner-occupied homes in $1000’s and a Detailed dictionary of the data description of all the variables will be attached to the end of this report. 

During the study of the dataset, It is notable that the prices of homes are capped at 50, which was caused by censorship 

Fig. 1. Residuals plots for the final regression model 


Predicting the median value of a home becomes complicated because it is related to many factors. Our final model tells us that the relationship between the nitrous oxide level and the median value of a home is very close 

Model Selection. Log transformations performed in an attempt to fit the MEDV variable into a linear relationship with the dependent variable. 

Firstly, we fit the full model, although the value of r squared shows that 74% of MEDV variables are explained by this regression model.

There are some variables that have high p values(see pic 1) and so are not significant, so we need to drop them and find a more accurate one. We then use backward and forward search using AIC to find out 2 regression models and then compare them and choose the one with smaller AIC value which will be the more appropriate one. The models found by backward and forward search become quite similar shown in pic2. They have the same r-squared(0.7406) and RMSE (4.736) value.

However, we found that the residuals vs fitted plot for this model are not so well (pic 4). The blue line appears to become quite curved, so we change the y variable to the log(y) and get a new model (pic 5). This model has a higher r-squared value. As a result, which means more variables become explained by it and a lower RMSE value, and all x variables are significant. Thus, we decide to take this model as the final model. 

The relationship between the nitrous oxide level and the median value of a home is very close. 

Performance. We compared our final model with the full model which contains all variables in the data set becoming used as predictors. Furthermore, out of sample performances become tested using the “Caret” package at a 10-fold cross-validation. 

Our final model formula consists of 11 variables. With the root-mean-square(RMS) , R-squared and mean Absolute Error we are able to measure the data of performance. Meanwhile, by contrasting with the simple model which possesses a RMSE of 4.78, the RMSE of the final model(0.19) is significantly smaller. The difference indicates a smaller prediction error of the final model. In addition, the final model has a larger R-squared value(0.78), that is, on the scale of 0-100% the strength of the relationship between the model “MEDV” and the dependent variable is 78%. Moreover, we also find that the mean absolute error(MAE) of the final model is 0.14. It is also much smaller than the MAE of the sample model(3.37). Hence, we conclude our final model becomes the model of good fit to predict the variable ‘MEDV”. 


Limitations. Apart from the limitation of the data source discussed in the data description. The theoretical limitation we have encountered arose from the principal drawbacks of 

stepwise multiple regression. Studies(1) have indicated that bias may exist in the process of parameter estimation and inconsistencies of model selection algorithms can also become problematic. Relying on a single best model can also bring risk factors. 

Conclusion. Through the assessment of various performance indicators including R-squared and the root mean square error of ‘MEDV’ model, our report concludes that the model is suitable for the purpose of the goodness of fit. However, a series of limitations that became outlined in this report should also become contemplated. To ensure the integrity and accuracy of the model. 

Indeed, multiple regression analysis is powerful in modelling certain variables in the Boston housing dataset. By applying a backward/forward search, we have reduced the number of variables to find the subset of variables within our dataset in the best performing model. However, other prediction methods may create a better model than multiple regression. In addition, it is through consistent experimentations with various algorithms such as k-nearest neighbours and polynomial regression. Furthermore, that we can discover the most appropriate model that fits the variable and develop our comprehensive understandings through our collaborative endeavour to develop and innovate. 


Boston Suburbs Home Values Statistics : Net Worth Of Homebuyers

GitHub repository 

(1)Mark J. Whittingham1 , Philip A. Stephens2 , Richard B. Bradbury3 & Robert P. Freckleton4 . Why do we still use stepwise modelling in ecology and behaviour? [Available at: https://eprints.ncl.ac.uk/file_store/production/56364/ AE9762E1-23E2-4C9D-9518-5BCDAD47FAAB.pdf

(2)Harrison, David & Rubinfeld, Daniel, Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management (1978), Journal of Environmental Economics and Management 

(3)O. W. Gilley, On the Harrison and Rubinfeld Data (1996), Journal of Environmental Economics and Management 

What is the median house price in Boston suburbs?


What is the median house price in Boston suburbs?