Hyung Uk Jee || Portfolio Predicting Volcano Etna Eruption

Predicting House Prices Using Advanced Regression Model

The American Dream is a phrase to describe a pursuit of various opportunities that are equally available to Americans, that allow them to reach for their highest life aspirations and goals. The core components of American Dream, according to Rank(2014), are the freedom to pursue one's interests and passions in life. Secondly, the importance of economic security and well-being. Lastly, the importance of having hope and optimism with respect to seeing process in one's life. These three beliefs constitute the common social goal that motivate Americans to uphold one another, and make an impact on other societies to lead by example. Even in this day and age where we have significantly polarizing political views, this idealism seems to be the sole resonating motivator that binds Americans together.

However, there are many criticisms that the contemporary interpretation of the phrase 'American Dream' become more of a reminder of a dream of individual success and wealth. According to Diamond(2018), the phrase meant the opposite of what it does now a century ago, and was repurposed by each generation. The original dream was not about the prospect of one's accomplishments or affluence, but was about the hope of equality, justice and democracy for the nation. It was not until the Cold War that the phrase became an argument for a consumer capitalist version of democracy, and have not been argued otherwise since the 1950's. Consequently, this argument has now become widely-accepted understanding of the phrase of this generation.

Nevertheless, the ideology goes far beyond its understanding of the phrase. The primary focus is always been more towards the abundance of equal opportunities available to Americans to achieve their individual dream and increase their quality of living. These opportunities might include accessibility to affordable healthcare, higher education, higher-paying quality jobs, and many more. Though the definition of quality of living may be highly subjective, these opportunities will undoubtably be a collection of factors that play a major role in healthy living, financial security, and job satisfaction. Ultimately, at the heart of the American Dream is always been about the desire to own a home as a reward to have kept dreaming the dream, and to inspire the next generation of American dreamers.

Source:

Rank, Mark. “What Is the American Dream?” Oxford University Press , 1 July 2014,
blog.oup.com/2014/07/questioning-american-dream/.
Diamond, Anna. “The Original Meanings of the ‘American Dream’ and ‘America First’ Were Starkly Different From How We Use Them Today.” Smithsonian Magazine , Oct. 2018.
smithsonianmag.com/history/behold-america-american-dream-slogan-book-sarah-churchwell-180970311/.

The above research depicts the home values consistantly rising outside the Great Recession. This may suggest a view of buying a house as a viable investment spending. It also may suggest a shift in the number of middle-income households, as well as the number of home-owners.

Source: Zillow Research

Above image shows an example of a typical checklist when looking for a house to buy. This draws a contrast between the kind of information that are typically collected a potential home buyer and that of what we are trying to study.

Source: Sahm Reviews

Homeownership became an essential product of the American Dream that is now synonymous with economic security and middle-class status. Buying a home may be the largest financial investment many will make so it is advised that a potential buyer is financially ready to make the purchase. Becoming a homeowner is not only a big financial commitment, but also a testament to make the best effort to maintain the property to keep the financial value of their house and the family's socioeconomic status. This is also the moment that has been long awaited for many families. It is expected for them to feel compelled to consider the perfect property within their budget that can encompass many aspects of their lives. Consequently, it is only natural that the home buying process that can capture these societal values as well as their needs and wants feels overwhelming.

Fortunately, one of many ways to cope when faced with such enormous life challenge is to follow a home buying checklist. A typical home buying checklist, much like the one depicted above, is a handy tool that breaks down the home-buying process step-by-step. It lays out a systematic approach to check for every feature of the house in detail to determine its attractiveness to the buyer. This will help to make the process feel manageable as well as predictable. In this study, we will attempt to put ourselves in the shoes of a potential home buyer in a similar situation. We will study a number of house features of various properties in a neighborhood of Iowa in an attempt to predict the affordability using a statistical learning model.

Feature Engineering Flow Chart

Scroll

Above image is a flow chart to depict the process of feature engineering to fit our prediction model.

Approach

We begin our process by dissecting the dataset into two major parts. These parts are categorical features and numeric features. Categorical features explain the quality and type variables. Numeric features convey the quantity and area variables. Moreover, we can split up the numeric variables further by initiating an exploratory data analysis. By performing a check for normality of our variables, we can break up our numeric variables into two smaller components separated by their level of skewness. Finally, we will be transforming our distribution of numeric variables to be approximately symmetrical to minimize the variance in our predictions, which will be explained further in detail below.

Assumption

We typically don't describe our dream house with the height of the basement ceiling or the proximity to an east-west railroad. But, the dataset that we will going to be exploring proves that there are much more influences to price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, we are challenged to predict the final price of each home. These variables focus primary on the quality and quantity of many physical attributes of the property. They help answer exactly the type of information that a typical home buyer would want to know about a potential property such as the year it was built, the square feet of living space, and the number of bedrooms. In this study, we will not be including any other information besides the features described above to fit our prediction model. In addition, the information about any other neighborhoods is not included in the dataset. Our focus is to predict sale prices of properties with solely on different attributes.

Goal

The goal for this study is to predict the sale prices of properties in Ames, Iowa by training a statitical model using various types of information about the physical attributes of each property.

Gradient Boosting Regression Model Diagram

Above image depicts our prediction model in the hierchy of feature importance.

Conclusion

What are some of the features of a property that drives its sale price? To answer this question, we need to understand what the demand is from a typical home buyer. There are a number of factors that influence a home's value. However, the most important element that significantly impacts its worth is the size of the property and usable space. According to a research from Zillow, the 67% of buyers are very interested in the area of the property's livable space and the number of bathrooms. In addition, 76% of buyers say that they are very interested in the number of bedrooms in the property. Our study also concludes that the square footage of living area is the most important feature that influences the sale price, if we assume that all the properties were concentrated in one neighborhood. Furthermore, our prediction model can estimate the sale price of a property with a mean error of $10,000 at 95% confidence level.

Walkthrough

Import Libraries and Define Data Paths

∘ Define Data Path

∘ Import Libraries

Define Custom Functions

∘ Define Exploratory Data Analysis Functions

∘ Define ML Model Functions

Define Plotting Functions

∘ Define Plotting Functions

Perform Data Acquisition and Preparation

∘ Perform Data Acquisition and Preparation

1. Define DataFrames, Features and Target

View DataFrame

Acquire Train Dataset and Transform DataFrame

∘ Retrieve Original Sample Train DataFrame

∘ Display Categorical Features in Train DataFrame

∘ Display Numerical Features in Train DataFrame

Retrieve Original Sample Test DataFrame

∘ Retrieve Original Sample Test DataFrame

∘ Display Categorical Features in Test DataFrame

∘ Display Numerical Features in Test DataFrame

Files

train.csv : the training set

test.csv : the test set

Columns

SalePrice : the property's sale price in dollars. This is the target variable that you're trying to predict.

Scroll

MSZoning : The general zoning classification

LotFrontage : Linear feet of street connected to property

LotArea : Lot size in square feet

Street : Type of road access

Alley : Type of alley access

LotShape : General shape of property

LandContour : Flatness of the property

Utilities : Type of utilities available

LotConfig : Lot configuration

LandSlope : Slope of property

Neighborhood : Physical locations within Ames city limits

Condition1 : Proximity to main road or railroad

Condition2 : Proximity to main road or railroad (if a second is present)

BldgType : Type of dwelling

HouseStyle : Style of dwelling

OverallQual : Overall material and finish quality

OverallCond : Overall condition rating

YearBuilt : Original construction date

YearRemodAdd : Remodel date

RoofStyle : Type of roof

RoofMatl : Roof material

Exterior1st : Exterior covering on house

Exterior2nd : Exterior covering on house (if more than one material)

MasVnrType : Masonry veneer type

MasVnrArea : Masonry veneer area in square feet

ExterQual : Exterior material quality

ExterCond : Present condition of the material on the exterior

Foundation : Type of foundation

BsmtQual : Height of the basement

BsmtCond : General condition of the basement

BsmtExposure : Walkout or garden level basement walls

BsmtFinType1 : Quality of basement finished area

BsmtFinSF1 : Type 1 finished square feet

BsmtFinType2 : Quality of second finished area (if present)

BsmtFinSF2 : Type 2 finished square feet

BsmtUnfSF : Unfinished square feet of basement area

TotalBsmtSF : Total square feet of basement area

Heating : Type of heating

HeatingQC : Heating quality and condition

CentralAir : Central air conditioning

Electrical : Electrical system

1stFlrSF : First Floor square feet

2ndFlrSF : Second floor square feet

LowQualFinSF : Low quality finished square feet (all floors)

GrLivArea : Above grade (ground) living area square feet

BsmtFullBath : Basement full bathrooms

BsmtHalfBath : Basement half bathrooms

FullBath : Full bathrooms above grade

HalfBath : Half baths above grade

Bedroom : Number of bedrooms above basement level

Kitchen : Number of kitchens

KitchenQual : Kitchen quality

TotRmsAbvGrd : Total rooms above grade (does not include bathrooms)

Functional : Home functionality rating

Fireplaces : Number of fireplaces

FireplaceQu : Fireplace quality

GarageType : Garage location

GarageYrBlt : Year garage was built

GarageFinish : Interior finish of the garage

GarageCars : Size of garage in car capacity

GarageArea : Size of garage in square feet

GarageQual : Garage quality

GarageCond : Garage condition

PavedDrive : Paved driveway

WoodDeckSF : Wood deck area in square feet

OpenPorchSF : Open porch area in square feet

EnclosedPorch : Enclosed porch area in square feet

3SsnPorch : Three season porch area in square feet

ScreenPorch : Screen porch area in square feet

PoolArea : Pool area in square feet

PoolQC : Pool quality

Fence : Fence quality

MiscFeature : Miscellaneous feature not covered in other categories

MiscVal : $Value of miscellaneous feature

MoSold : Month Sold

YrSold : Year Sold

Scroll

SaleType : Type of sale

SaleCondition : Condition of sale

∘ Handling Missing Values

View DataFrame

∘ Dispaly Missing Values DataFrame

Plotting Top 20 Columns with Most Number of Missing Values

Above image depicts a bar graph of the number of missing values in our features.

View Missing Values Handling Function

∘ Display Missing Values Handling Function

Our feature engineering begins with checking for missing values in our data or information that is not consistent. For example, we have a significant number of values information missing about the quality of swimming pool included in a number of properties. To determine its quality, we first have to make sure that the property includes a pool. By checking the information containing the square footage of the pool, we may be able to extract a key detail about whether the property includes a swimming pool or not. If the information about the area of the pool is not there, or the value is zero, we can safely assume that the property does not include a pool. On the other hand, if we determined that the property does include a swimming pool, and the value for the area is greater than zero, we will handle our missing information about the pool quality as below average to make sure that we do not exaggerate or underrate the property's attribute for the benefit of the doubt. By exploring different type and aspect of information regarding the same physical attribute of the property, we will be able to manage any inconsistency in our data. With this rationale, we will try to tackle rest of the features with any missing values.

∘ Exploring Numeric Features

Plotting Distribution of Numeric Features

Above image depicts a box plot of distributions of our numeric features. Note that it is in logarithmic scale for better illustration.

Plotting Linear Regression Between Numeric Features Vs. Target

Swipe

Each linear regression chart compares the correlation between different numeric feature and our target feature.

∘ Exploring Categorial Features

Calculating Values Count per Column

The above image depicts the dataframe of value counts of individual categorical feature.

Plotting Distribution Between Categorical Features Vs. Target

Swipe

Each histogram compares different distributions of each categorical feature values, seperated by our target feature

∘ Graphing Correlation Between Numeric Features and Target

Correlation Between Numeric Features and Target

The heatmap depicts the correlation between each numeric features to one another.

Larger and Scrollable View

∘ Correlation Between Numeric Features

The evidence that exploring a simple linear relationship between our numeric features and target values is not a viable analysis is more apparent noting the correlation each other. First, we can observe a strong positive correlation between OverallQual vs. target values. We can also observe a similar relationship between the GrLivArea and sale price. On other hand, we can observe right-skewed distribution of FireplaceQu vs. target values, and also that of HouseType vs. sale price. Overall, the observed correlations between either type of features and the target values generally have a weak linear relationship.

∘ Transform DataFrame to Fit Our Model

The top half of above image depicts the difference between the before and after transformation to normal distribution of our numeric features. Similarly, the bottom half of above image depicts the difference between the before and after transformation to normal distribution of our target values.

As explained above, it is important that we work with a normalized distribution to ensure that we minimize any variance in our predictions. Thus, we will be applying a number of transformations according to their variable types similar to depicted above. We will be transforming our categorical variables into a series of numbers for the computer to interpret easier. Next, we will be performing the Yeo-Johnson power transformation to our more skewed numeric variables. On the other hand, we will be performing a standard scaler to our approximately symmetric numeric variables. Lastly, we will be performing a logistic transformation to our target variable. The right half of the image above depicts the transformed normalized distributions of our numeric values, which we will be using to fit our prediction model.

2. Perform Manual Parameter Tuning for Better Model Fit

Define Preprocessing Transformer, Pipeline and Model Parameters for Better Fit

∘ Define Preprocessing Transformers and Pipeline

∘ Define Each Model Parameters

The model we are fitting for this prediction model is Gradient Boosting Regressor from Scikit-Learn. The varsatility of an ensemble of forest algorithm is very impressive and handles bias and variance very well. However, we have to be very careful not to overfit the training data, which happens quite often. So, my proposed solution is to perform a very broad range of each parameter to see the effect on the predictive accuracy, and determine which combination of parameters result in underfitting or overfitting prediction model. Later, we can narrow down our combinations of parameters to a set of select few depending on the performance of our initial fitting. Finally, we will be using this smaller set of parameters to fit a more aggressive fitting to find the best estimator.

∘ Plotting Fit Performance Using RMSE Score and Percent Difference

View DataFrame

∘ Determine Fit Performance with RMSE Score DataFrame For Better Fit

The combinations of parameters we are going to iterate are as follows: loss function types, the subsample ratio, the number of estimators, feature types, the number of depth, the number of sample split, the number of sample leaves, and learning rate ratio. In this initial parameter tuning, we are comparing the Root Mean Square Error, which is the standard deviation of the prediction errors between the training and the validation set. The range of combinations of the least square percentage difference between them will be used to fit the prediction model with the best estimator.

Here are the range of parameters in the "goldilocks zone" of bias-variance plot for this study:

Loss Function: Huber, Quantile
Subsample: 0.3 - 0.7
Number of Estimators: 1000
Maximum Number of Features: Log2
Maximum Number of Depth: 2 - 11
Minimum Num of Samples Split: 2 - 11
Minimum Num of Samples Leaf: 2 ~ 11
Learning Rate: 0.001 - 0.1

3. Improve Model Further with GridSearch and Cross-Validation Methods

Fit Our Model with Best Estimators

∘ Define Parameters for Best Estimator

∘ Perform GridSearch Method, Model Fitting with Best Estimator, and Cross-Validate Fitting

View DataFrame

Display Girdsearch Results, Feature Importance, and Predictions

∘ Determine Fit Performance with GridSearch Best Estimator and RMSE Score DataFrame

∘ Determine Feature Importance with Feature Importance DataFrame

∘ Make Predictions on Test Set DataFrame

Before fitting the model with a set of parameters, we now have a narrowed down set of parameter combinations from our initial parameter tuning. Furthermore, we will tune our prediction model with a more aggressive to find the best parameter to fit. We will be spliting our training data with shuffle split method inside and outside of the gridsearch for cross-validation. Next, we will fit the best estimator of our gradient boosting regressor model. Lastly, we score our fit and the prediction values with Root Mean Squared Error.

∘ Plotting Gradient Boosting Regressor Model Fitted with Best Estimator On A Tree Diagram

Gradient Boosting Regressor Model Diagram

Larger and Scrollable View

∘ Plotting Tree Diagram of Grandient Bossting Regression Model

∘ Plotting Feature Importance Bar Graph

Feature Importance Bar Charts

The feature importance percentage tells us that the percentage of how much the feature has the influence in our model to predict the test data. The feature with the highest feature importance is 'GrLivArea_log', followed by 'GarageCars_square', and 'GrLivArea'. However, we can note that the splitting nodes near the top of the hierarchy are 'GrLivArea', followed by 'GarageCars_log', and 'YearRemodAdd_log'. The relative position of the splitting nodes does not necessarily mean that the feature is the more important feature in a predictive model, but rather more prominent splitting feature individually. Feature importance metric tells us the overall performance of each features, which could be used to classify in multiple decisions defined by the summation of Gini importance. However, the relative position of the decision tree hierarchy tells us that the performance of a classifying feature for a single classification defined by Gini importance. Even though the feature 'GrLivArea_log' has the highest feature importance, or largest summation of Gini importance, of the entire regressor model, the feature 'GrLivArea' has the highest feature importance of an instance, or largest individual Gini importance.

∘ Plotting Standard Deivation of Model Prediction Values

View DataFrame

∘ Display Standard Deviation of Model Prediction Values Dataframe

The above graph shows the standard deviation of our prediction values from five cross-validations with confidence level of 95%.

Predicting House Sale Prices in Ames, Iowa

Predicting House Prices Using Advanced Regression Model

Feature Engineering Flow Chart

Approach

Assumption

Goal

Gradient Boosting Regression Model Diagram

Conclusion

Walkthrough

∘ Define Data Path

∘ Import Libraries

∘ Define Exploratory Data Analysis Functions

∘ Define ML Model Functions

∘ Define Plotting Functions

∘ Perform Data Acquisition and Preparation

1. Define DataFrames, Features and Target

Acquire Train Dataset and Transform DataFrame

∘ Retrieve Original Sample Train DataFrame

∘ Display Categorical Features in Train DataFrame

∘ Display Numerical Features in Train DataFrame

Retrieve Original Sample Test DataFrame

∘ Retrieve Original Sample Test DataFrame

∘ Display Categorical Features in Test DataFrame

∘ Display Numerical Features in Test DataFrame

∘ Handling Missing Values

∘ Dispaly Missing Values DataFrame

Plotting Top 20 Columns with Most Number of Missing Values

∘ Display Missing Values Handling Function

∘ Exploring Numeric Features

Plotting Distribution of Numeric Features

Plotting Linear Regression Between Numeric Features Vs. Target

∘ Exploring Categorial Features

Calculating Values Count per Column

Plotting Distribution Between Categorical Features Vs. Target

∘ Graphing Correlation Between Numeric Features and Target

Correlation Between Numeric Features and Target

∘ Correlation Between Numeric Features

∘ Transform DataFrame to Fit Our Model

2. Perform Manual Parameter Tuning for Better Model Fit

∘ Define Preprocessing Transformers and Pipeline

∘ Define Each Model Parameters

∘ Plotting Fit Performance Using RMSE Score and Percent Difference

∘ Determine Fit Performance with RMSE Score DataFrame For Better Fit

3. Improve Model Further with GridSearch and Cross-Validation Methods

∘ Define Parameters for Best Estimator

∘ Perform GridSearch Method, Model Fitting with Best Estimator, and Cross-Validate Fitting

Display Girdsearch Results, Feature Importance, and Predictions

∘ Determine Fit Performance with GridSearch Best Estimator and RMSE Score DataFrame

∘ Determine Feature Importance with Feature Importance DataFrame

∘ Make Predictions on Test Set DataFrame

∘ Plotting Gradient Boosting Regressor Model Fitted with Best Estimator On A Tree Diagram

Gradient Boosting Regressor Model Diagram

∘ Plotting Tree Diagram of Grandient Bossting Regression Model

∘ Plotting Feature Importance Bar Graph

Feature Importance Bar Charts

∘ Plotting Standard Deivation of Model Prediction Values

∘ Display Standard Deviation of Model Prediction Values Dataframe

I am open to work!

Next Project

Classifying Real Disaster Tweets