# Curiousily

## Predicting House Prices

R, Regression, Random Forest3 min read

Share

So you have a house for sale or buying one? What is a fair price for it? Can we predict it correctly?

Let’s use the “House Sales in King County” data available at Kaggle to answer that question. Each row of the dataset contains information about a home sold between May 2014 and May 2015 along with the price in US dollars. Some of the other features include:

• bedrooms - number of bedrooms
• bathrooms - number of bathrooms
• floors - number of floors
• yr_built - year built
• zipcode
• long - longitude
• lat - latitude
• condition - building condition (ordered categorical variable in the range 1 - 5)
• grade - construction quality of improvements (ordered categorical variable in the range 1 - 13)

If not interested in house prices you still can learn something about regression, classification trees, and extreme gradient boosting.

# Fire up R and load some libraries

1library(ggplot2)2library(reshape2)3library(plyr)4library(dplyr)5library(rpart)6library(rpart.plot)7library(caret)8library(doMC)9library(scales)10library(GGally)

Load our utility functions, make results reproducible and instruct R to use all our CPU cores (my PC has 8 cores, you might want to revise that value for yours).

1source("utils.R")23set.seed(42)4theme_set(theme_minimal())5registerDoMC(cores = 8)6options(warn=-1)

# Load and preprocess the dataset

1df <- read.csv("data/kc_house_data.csv", stringsAsFactors = FALSE)
1print(paste("rows:", nrow(df), "cols:", ncol(df)))
1[1] "rows: 21613 cols: 21"

Remove id and date columns and instruct R to interpret condition, view, grade and waterfront as factors.

1df <- df[-c(1, 2)]2df$condition <- as.factor(df$condition)3df$view <- as.factor(df$view)4df$grade <- as.factor(df$grade)5df$waterfront <- as.factor(df$waterfront)

# Exploration

Do we have missing data?

1ggplot_missing(df)

{:.center}

It looks like everything is in here! Great!

## Maps

The following awesome maps were created by Thierry Ellena. Let’s have a look at them:

Let’s look at the distribution of house condtion, grade and price:

1p1 <- qplot(condition, data=df, geom = "bar",2    main="Number of houses by condition")34p2 <- qplot(grade, data=df, geom = "bar",5    main="Number of houses by grade")67p3 <- ggplot(df, aes(price)) + geom_density() +8    scale_y_continuous(labels = comma) +9    scale_x_continuous(labels = comma, limits = c(0, 2e+06)) +10    xlab("price") +11    ggtitle("Price distribution")1213multiplot(p1, p2, p3)

{:.center}

And a look at price (log10) vs other features:

1ggplot(df, aes(x=log10(price), y=sqft_living)) +2    geom_smooth() +3    scale_y_continuous(labels = comma) +4    scale_x_continuous(labels = comma) +5    ylab("sqft of living area") +6    geom_point(shape=1, alpha=1/10) +7    ggtitle("Price (log10) vs sqft of living area")

{:.center}

1ggplot(df, aes(x=grade, y=log10(price))) +2    geom_boxplot() +3    scale_y_continuous(labels = comma) +4    coord_flip() +5    geom_point(shape=1, alpha=1/10) +6    ggtitle("Price (log10) vs grade")

{:.center}

1ggplot(df, aes(x=condition, y=log10(price))) +2    geom_boxplot() +3    scale_y_continuous(labels = comma) +4    coord_flip() +5    geom_point(shape=1, alpha=1/10) +6    ggtitle("Price (log10) vs condition")

{:.center}

1ggplot(df, aes(x=as.factor(floors), y=log10(price))) +2    geom_boxplot() +3    scale_y_continuous(labels = comma) +4    xlab("floors") +5    coord_flip() +6    geom_point(shape=1, alpha=1/10) +7    ggtitle("Price (log10) vs number of floors")

{:.center}

How different features correlate?

1ggcorr(df, hjust = 0.8, layout.exp = 1) +2    ggtitle("Correlation between house features")

{:.center}

# Splitting the data

We will split the data using the caret package. 90% will be used for training and 10% for testing.

{:.center}

# Compare distributions of predictions

Let’s see how the tree distributions compare to each other:

1res <- data.frame(price=c(tree_predicted, xgb_predicted, test_labels),2                  type=c(replicate(length(tree_predicted), "tree"),3                         replicate(length(xgb_predicted), "xgb"),4                         replicate(length(test_labels), "actual")5                        ))67ggplot(res, aes(x=price, colour=type)) +8    scale_x_continuous(labels = comma, limits = c(0,2e+06)) +9    scale_y_continuous(labels = comma) +10    geom_density()

{:.center}

Again, we can confirm that the Boosted Trees model provides much more accurate distribution with its predictions.

# How well we did, really?

Let’s randomly choose 10 rows and look at the difference between predicted and actual price:

1test_sample <- sample_n(test, 10, replace=FALSE)2test_predictions <- predict(xgb_fit, test_sample, "raw")3actual_prices <- round(test_sample\$price, 0)4predicted_prices <- round(test_predictions, 0)5data.frame(actual=actual_prices,6    predicted=predicted_prices,7    difference=actual_prices-predicted_prices)
actualpredicteddifference
680000566726113274
14000001502961-102961
400000465854-65854
46800038287085130
22000020851011490
525000553434-28434
404000559599-155599
32700031622610774
47500046028814712
44300043131011690

Is this good? Well, personally I expected more. However, there are certainly more things to try if you are up to it. One interesting question that arises after receiving prediction is: How sure the model is that the price is what he tells us it is? But that is a topic for another post.

# References

### caret

Tuning parameters in caret

Share

## Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

© 2020 Curiousily by Venelin Valkov