— Data Imputation, R — 4 min read
Share
Let’s begin by you trying to watch this video:
This video has 300k+ views with around 1.6k positive reactions. It might be worth watching, considering it is in some cryptic foreign language (Know Bulgarian? That’s how it feels to have some serious superpowers!). But you have a problem, right? You don’t know what the video is all about!
Well, you are missing some data. In your case, subtitles might be a good start. However, knowing the language, experiencing the emotions that the actor is trying to convey might be much better! So, it is not enough to have some data, it is best to have the real thing.
Let’s try to translate the problem into R
Let’s say we are trying to find the mean weight of all women in New York City. We weight 1000 women and write down their weights. The first 5 begin as follows:
1weights <- c(55.4, 48.5, 58.5, 63.4, 67.4)
We can easily get the mean of that sample:
1mean(weights_sample)
58.64
But let’s say that the last value was missing and we are trying to get the mean of the sample:
1weights_missing <- c(55.4, 48.5, 58.5, 63.4, NA)2mean(weights_missing)
1[1] NA
Huh? The result is undefined. What did you expect? Can we hack our way around this? Of course, let’s be sneaky and just ignore the missing value.
1mean(weights_missing, na.rm = TRUE)
56.45
Yep, we’ve done it! Good job! And this is the end of the post. Or is it?
Just imagine this title in your favorite newspaper tomorrow morning:
Scientists found that the average weight of a woman in New York City is 56.45 kilograms.
Is this correct? Can we just ignore the values that are missing? And the answer is - it depends.
So, what is the best way to handle missing data? As Orchard and Woodbury (1972, p. 697) remarked:
Obviously, the best way to treat missing data is not to have them.
Of course, they knew that this was impossible ideal to achieve in practice.
Allison (2002, p. 1) observed that:
Sooner or later (usually sooner), anyone who does statistical analysis runs into problems with missing data.
So, how should we deal with this?
Before answering the question above we have to familiarize ourselves with the strange terms described above.
Rubin (1976) defined three categories of missing data problems. He stated that every data point has some probability of being missing. What defines those probabilities is a process called the missing data mechanism or response mechanism. A model for this process is called missing data model or response model.
When the probability of a data point being missing is the same, the data are said to be missing completely at random (MCAR). For example, let’s say we go to every woman in NYC, flip a coin and weight her only if the coin show heads. Each data point in this will have the same probability of being missing, e.g. the probability of the coin landing heads.
If the probability of being missing is the same only within the groups defined by the observed data, then the data are missing at random (MAR). Let’s take our previous example and modify it a bit. Let’s say we weigh only women that are between 30 and 40 years of age.
MAR is a generalization of MCAR. Thus, much more useful in practice.
You might already know what this category represents. Missing not at random (or Not missing at random) means that the probability of being missing is unknown. For example, the heavier ladies from the example above might not be so eager to weigh themselves, but this might not be known by us.
MNAR is really hard to handle in practice. Strategies to handle this case include finding the cause of missing data or provide hypothesis (what-if analysis) and see how sensitive the results are.
Just delete the rows that have missing data. It is a really convenient way to handle the issue. Under MCAR this might be an ok thing to do. However, Schafer and Graham (2002) demonstrated using an elegant simulation the bias of listwise deletion under MAR and MNAR.
This method tries to be a bit smarter than listwise deletion. Pairwise deletion works by calculating the means and (co)variances on all observed data. Given two variables X and Y, the mean of X is based on the observed X values, the same goes for Y. To calculate correlation and covariance, pairwise deletion takes all data rows where X and Y are present.
1library(tidyverse)23heights_missing <- c(171, 165, NA, 173, 172)4df <- tibble(heights = heights_missing, weights = weights_missing)
1lapply(df, mean, na.rm = T)
1cor(df, use = "pair")
heights | weights | |
---|---|---|
heights | 1.0000000 | 0.9480866 |
weights | 0.9480866 | 1.0000000 |
1cov(df, use = "pair")
heights | weights | |
---|---|---|
heights | 12.91667 | 29.43333 |
weights | 29.43333 | 38.93667 |
This method is an easy way to compute the mean, correlations, and covariances under MCAR. However, the estimates can be biased when the data is not MCAR.
The big selling point of pairwise deletion is that it tries to use all available data.
This one is real simple. Replace every missing value by the mean. Let’s try it:
1library(mice)23imputed <- mice(df, method = "mean", m = 1, maxit = 1)
1iter imp variable2 1 1 heights weights
1completed <- complete(imputed, 1)2completed
heights | weights |
---|---|
171.00 | 55.40 |
165.00 | 48.50 |
170.25 | 58.50 |
173.00 | 63.40 |
172.00 | 56.45 |
Let’s have another look at the initial data and the correlations:
1df
heights | weights |
---|---|
171 | 55.4 |
165 | 48.5 |
NA | 58.5 |
173 | 63.4 |
172 | NA |
1cor(df, use = "pair")
heights | weights | |
---|---|---|
heights | 1.0000000 | 0.9480866 |
weights | 0.9480866 | 1.0000000 |
1cor(completed, use = "pair")
heights | weights | |
---|---|---|
heights | 1.0000000 | 0.8927452 |
weights | 0.8927452 | 1.0000000 |
The correlations are different. That’s no good. Granted, this is a bit contrived example, but you can see how this can happen in the real world.
Mean imputation offers a simple and fast fix for missing data. However, it will bias any estimate other than the mean when data are not MCAR.
This method creates regression model and uses it for completing missing values. Additionally, it adds noise to the predictions, thus reducing the correlation between the variables. Let’s have a look:
1imputed <- mice(df, method = "norm.nob", m = 1, maxit = 1, seed = 42)
1iter imp variable2 1 1 heights weights
method = "norm.nob"
tells mice to use the stochastic regression method.1completed <- complete(imputed, 1)2completed
heights | weights |
---|---|
171.0000 | 55.40000 |
165.0000 | 48.50000 |
169.3624 | 58.50000 |
173.0000 | 63.40000 |
172.0000 | 61.86775 |
1cor(df, use = "pair")
heights | weights | |
---|---|---|
heights | 1.0000000 | 0.9480866 |
weights | 0.9480866 | 1.0000000 |
1cor(completed, use = "pair")
heights | weights | |
---|---|---|
heights | 1.0000000 | 0.9240839 |
weights | 0.9240839 | 1.0000000 |
Now, this is better. Stochastic regression imputation is a nice step in the right direction, but it does not solve all our problems. However, the idea of using other variable values is fundamental for more advanced techniques.
A couple of big questions remains still open. How can we tell if the imputation is good? Can we improve on those simple methods? Let’s try to answer those in the next post.
Share
You'll never get spam from me