Curiousily

Introduction to Data Imputation | What to do when data is missing? (Part I)

15.01.2017 — Data Imputation, R — 4 min read

Let’s begin by you trying to watch this video:

This video has 300k+ views with around 1.6k positive reactions. It might be worth watching, considering it is in some cryptic foreign language (Know Bulgarian? That’s how it feels to have some serious superpowers!). But you have a problem, right? You don’t know what the video is all about!

Well, you are missing some data. In your case, subtitles might be a good start. However, knowing the language, experiencing the emotions that the actor is trying to convey might be much better! So, it is not enough to have some data, it is best to have the real thing.

1. Problem definition

Let’s try to translate the problem into R

Let’s say we are trying to find the mean weight of all women in New York City. We weight 1000 women and write down their weights. The first 5 begin as follows:

1weights <- c(55.4, 48.5, 58.5, 63.4, 67.4)

We can easily get the mean of that sample:

1mean(weights_sample)

58.64

But let’s say that the last value was missing and we are trying to get the mean of the sample:

1weights_missing <- c(55.4, 48.5, 58.5, 63.4, NA)
2mean(weights_missing)

1[1] NA

Huh? The result is undefined. What did you expect? Can we hack our way around this? Of course, let’s be sneaky and just ignore the missing value.

1mean(weights_missing, na.rm = TRUE)

56.45

Yep, we’ve done it! Good job! And this is the end of the post. Or is it?

Just imagine this title in your favorite newspaper tomorrow morning:

Scientists found that the average weight of a woman in New York City is 56.45 kilograms.

Is this correct? Can we just ignore the values that are missing? And the answer is - it depends.

2. Come on, this can’t be that hard!

So, what is the best way to handle missing data? As Orchard and Woodbury (1972, p. 697) remarked:

Obviously, the best way to treat missing data is not to have them.

Of course, they knew that this was impossible ideal to achieve in practice.

Allison (2002, p. 1) observed that:

Sooner or later (usually sooner), anyone who does statistical analysis runs into problems with missing data.

So, how should we deal with this?

2.1 MCAR, MAR and MNAR

Before answering the question above we have to familiarize ourselves with the strange terms described above.

Rubin (1976) defined three categories of missing data problems. He stated that every data point has some probability of being missing. What defines those probabilities is a process called the missing data mechanism or response mechanism. A model for this process is called missing data model or response model.

2.1.1 MCAR

When the probability of a data point being missing is the same, the data are said to be missing completely at random (MCAR). For example, let’s say we go to every woman in NYC, flip a coin and weight her only if the coin show heads. Each data point in this will have the same probability of being missing, e.g. the probability of the coin landing heads.

2.1.2 MAR

If the probability of being missing is the same only within the groups defined by the observed data, then the data are missing at random (MAR). Let’s take our previous example and modify it a bit. Let’s say we weigh only women that are between 30 and 40 years of age.

MAR is a generalization of MCAR. Thus, much more useful in practice.

2.1.3 MNAR (or NMAR)

You might already know what this category represents. Missing not at random (or Not missing at random) means that the probability of being missing is unknown. For example, the heavier ladies from the example above might not be so eager to weigh themselves, but this might not be known by us.

MNAR is really hard to handle in practice. Strategies to handle this case include finding the cause of missing data or provide hypothesis (what-if analysis) and see how sensitive the results are.

2.2 Simple fixes

2.2.1 Listwise deletion

Just delete the rows that have missing data. It is a really convenient way to handle the issue. Under MCAR this might be an ok thing to do. However, Schafer and Graham (2002) demonstrated using an elegant simulation the bias of listwise deletion under MAR and MNAR.

2.2.2 Pairwise deletion (Available-case analysis)

This method tries to be a bit smarter than listwise deletion. Pairwise deletion works by calculating the means and (co)variances on all observed data. Given two variables $X$ and $Y$ , the mean of $X$ is based on the observed $X$ values, the same goes for $Y$ . To calculate correlation and covariance, pairwise deletion takes all data rows where $X$ and $Y$ are present.

1library(tidyverse)
2
3heights_missing <- c(171, 165, NA, 173, 172)
4df <- tibble(heights = heights_missing, weights = weights_missing)

1lapply(df, mean, na.rm = T)

$heights: 170.25
$weights: 56.45

1cor(df, use = "pair")

	heights	weights
heights	1.0000000	0.9480866
weights	0.9480866	1.0000000

1cov(df, use = "pair")

	heights	weights
heights	12.91667	29.43333
weights	29.43333	38.93667

This method is an easy way to compute the mean, correlations, and covariances under MCAR. However, the estimates can be biased when the data is not MCAR.

The big selling point of pairwise deletion is that it tries to use all available data.

2.2.3 Mean imputation

This one is real simple. Replace every missing value by the mean. Let’s try it:

1library(mice)
2
3imputed <- mice(df, method = "mean", m = 1, maxit = 1)

1iter imp variable
2  1   1  heights  weights

1completed <- complete(imputed, 1)
2completed

heights	weights
171.00	55.40
165.00	48.50
170.25	58.50
173.00	63.40
172.00	56.45

Let’s have another look at the initial data and the correlations:

1df

heights	weights
171	55.4
165	48.5
NA	58.5
173	63.4
172	NA

1cor(df, use = "pair")

	heights	weights
heights	1.0000000	0.9480866
weights	0.9480866	1.0000000

1cor(completed, use = "pair")

	heights	weights
heights	1.0000000	0.8927452
weights	0.8927452	1.0000000

The correlations are different. That’s no good. Granted, this is a bit contrived example, but you can see how this can happen in the real world.

Mean imputation offers a simple and fast fix for missing data. However, it will bias any estimate other than the mean when data are not MCAR.

2.2.4 Stochastic regression imputation

This method creates regression model and uses it for completing missing values. Additionally, it adds noise to the predictions, thus reducing the correlation between the variables. Let’s have a look:

1imputed <- mice(df, method = "norm.nob", m = 1, maxit = 1, seed = 42)

1iter imp variable
2  1   1  heights  weights

method = "norm.nob" tells mice to use the stochastic regression method.

1completed <- complete(imputed, 1)
2completed

heights	weights
171.0000	55.40000
165.0000	48.50000
169.3624	58.50000
173.0000	63.40000
172.0000	61.86775

1cor(df, use = "pair")

	heights	weights
heights	1.0000000	0.9480866
weights	0.9480866	1.0000000

1cor(completed, use = "pair")

	heights	weights
heights	1.0000000	0.9240839
weights	0.9240839	1.0000000

Now, this is better. Stochastic regression imputation is a nice step in the right direction, but it does not solve all our problems. However, the idea of using other variable values is fundamental for more advanced techniques.

3. Where to go from here?

A couple of big questions remains still open. How can we tell if the imputation is good? Can we improve on those simple methods? Let’s try to answer those in the next post.

References

Orchard, Terence and Woodbury, Max A and others. 1972. _A missing information principle: theory and applications_.
Allison, Paul D. 2001. _Missing data_.
Rubin, Donald B. 1976. _Inference and missing data_.
Schafer, Joseph L and Graham, John W. 2002. _Missing data: our view of the state of the art._.

Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me

Hacker's Guide to Neural Networks in JavaScript

Build Machine Learning models (especially Deep Neural Networks) that you can easily integrate with existing or new web apps. Think of your ReactJs, Vue, or Angular app enhanced with the power of Machine Learning models.

Get SH*T Done with PyTorch

Learn how to solve real-world problems with Deep Learning models (NLP, Computer Vision, and Time Series). Go from prototyping to deployment with PyTorch and Python!

Hacker's Guide to Machine Learning with Python

This book brings the fundamentals of Machine Learning to you, using tools and techniques used to solve real-world problems in Computer Vision, Natural Language Processing, and Time Series analysis. The skills taught in this book will lay the foundation for you to advance your journey to Machine Learning Mastery!

Hands-On Machine Learning from Scratch

This book will guide you on your journey to deeper Machine Learning understanding by developing algorithms in Python from scratch! Learn why and when Machine learning is the right tool for the job and how to improve low performing models!