What to do when data is missing? - Part I

Let’s begin by you trying to watch this video:

This video has 300k+ views with around 1.6k positive reactions. It might be worth watching, considering it is in some cryptic foreign language (Know Bulgarian? That’s how it feels to have some serious superpowers!). But you have a problem, right? You don’t know what the video is all about!

Well, you are missing some data. In your case, subtitles might be a good start. However, knowing the language, experiencing the emotions that the actor is trying to convey might be much better! So, it is not enough to have some data, it is best to have the real thing.

1. Problem definition

Let’s try to translate the problem into R

Let’s say we are trying to find the mean weight of all women in New York City. We weight 1000 women and write down their weights. The first 5 begin as follows:

weights <- c(55.4, 48.5, 58.5, 63.4, 67.4)

We can easily get the mean of that sample:

mean(weights_sample)

58.64

But let’s say that the last value was missing and we are trying to get the mean of the sample:

weights_missing <- c(55.4, 48.5, 58.5, 63.4, NA)
mean(weights_missing)
[1] NA

Huh? The result is undefined. What did you expect? Can we hack our way around this? Of course, let’s be sneaky and just ignore the missing value.

mean(weights_missing, na.rm = TRUE)

56.45

Yep, we’ve done it! Good job! And this is the end of the post. Or is it?

Just imagine this title in your favorite newspaper tomorrow morning:

Scientists found that the average weight of a woman in New York City is 56.45 kilograms.

Is this correct? Can we just ignore the values that are missing? And the answer is - it depends.

2. Come on, this can’t be that hard!

So, what is the best way to handle missing data? As Orchard and Woodbury (1972, p. 697) remarked:

Obviously, the best way to treat missing data is not to have them.

Of course, they knew that this was impossible ideal to achieve in practice.

Allison (2002, p. 1) observed that:

Sooner or later (usually sooner), anyone who does statistical analysis runs into problems with missing data.

So, how should we deal with this?

2.1 MCAR, MAR and MNAR

Before answering the question above we have to familiarize ourselves with the strange terms described above.

Rubin (1976) defined three categories of missing data problems. He stated that every data point has some probability of being missing. What defines those probabilities is a process called the missing data mechanism or response mechanism. A model for this process is called missing data model or response model.

2.1.1 MCAR

When the probability of a data point being missing is the same, the data are said to be missing completely at random (MCAR). For example, let’s say we go to every woman in NYC, flip a coin and weight her only if the coin show heads. Each data point in this will have the same probability of being missing, e.g. the probability of the coin landing heads.

2.1.2 MAR

If the probability of being missing is the same only within the groups defined by the observed data, then the data are missing at random (MAR). Let’s take our previous example and modify it a bit. Let’s say we weigh only women that are between 30 and 40 years of age.

MAR is a generalization of MCAR. Thus, much more useful in practice.

2.1.3 MNAR (or NMAR)

You might already know what this category represents. Missing not at random (or Not missing at random) means that the probability of being missing is unknown. For example, the heavier ladies from the example above might not be so eager to weigh themselves, but this might not be known by us.

MNAR is really hard to handle in practice. Strategies to handle this case include finding the cause of missing data or provide hypothesis (what-if analysis) and see how sensitive the results are.

2.2 Simple fixes

2.2.1 Listwise deletion

Just delete the rows that have missing data. It is a really convenient way to handle the issue. Under MCAR this might be an ok thing to do. However, Schafer and Graham (2002) demonstrated using an elegant simulation the bias of listwise deletion under MAR and MNAR.

2.2.2 Pairwise deletion (Available-case analysis)

This method tries to be a bit smarter than listwise deletion. Pairwise deletion works by calculating the means and (co)variances on all observed data. Given two variables and , the mean of is based on the observed values, the same goes for . To calculate correlation and covariance, pairwise deletion takes all data rows where and are present.

library(tidyverse)

heights_missing <- c(171, 165, NA, 173, 172)
df <- tibble(heights = heights_missing, weights = weights_missing)
lapply(df, mean, na.rm = T)
$heights
170.25
$weights
56.45
cor(df, use = "pair")
heightsweights
heights1.00000000.9480866
weights0.94808661.0000000
cov(df, use = "pair")
heightsweights
heights12.9166729.43333
weights29.4333338.93667

This method is an easy way to compute the mean, correlations, and covariances under MCAR. However, the estimates can be biased when the data is not MCAR.

The big selling point of pairwise deletion is that it tries to use all available data.

2.2.3 Mean imputation

This one is real simple. Replace every missing value by the mean. Let’s try it:

library(mice)

imputed <- mice(df, method = "mean", m = 1, maxit = 1)
 iter imp variable
  1   1  heights  weights
completed <- complete(imputed, 1)
completed
heightsweights
171.0055.40
165.0048.50
170.2558.50
173.0063.40
172.0056.45

Let’s have another look at the initial data and the correlations:

df
heightsweights
171 55.4
165 48.5
NA 58.5
173 63.4
172 NA
cor(df, use = "pair")
heightsweights
heights1.00000000.9480866
weights0.94808661.0000000
cor(completed, use = "pair")
heightsweights
heights1.00000000.8927452
weights0.89274521.0000000

The correlations are different. That’s no good. Granted, this is a bit contrived example, but you can see how this can happen in the real world.

Mean imputation offers a simple and fast fix for missing data. However, it will bias any estimate other than the mean when data are not MCAR.

2.2.4 Stochastic regression imputation

This method creates regression model and uses it for completing missing values. Additionally, it adds noise to the predictions, thus reducing the correlation between the variables. Let’s have a look:

imputed <- mice(df, method = "norm.nob", m = 1, maxit = 1, seed = 42)
 iter imp variable
  1   1  heights  weights

method = "norm.nob" tells mice to use the stochastic regression method.

completed <- complete(imputed, 1)
completed
heightsweights
171.000055.40000
165.000048.50000
169.362458.50000
173.000063.40000
172.000061.86775
cor(df, use = "pair")
heightsweights
heights1.00000000.9480866
weights0.94808661.0000000
cor(completed, use = "pair")
heightsweights
heights1.00000000.9240839
weights0.92408391.0000000

Now, this is better. Stochastic regression imputation is a nice step in the right direction, but it does not solve all our problems. However, the idea of using other variable values is fundamental for more advanced techniques.

3. Where to go from here?

A couple of big questions remains still open. How can we tell if the imputation is good? Can we improve on those simple methods? Let’s try to answer those in the next post.

References

  • Orchard, Terence and Woodbury, Max A and others. 1972. A missing information principle: theory and applications.

  • Allison, Paul D. 2001. Missing data.

  • Rubin, Donald B. 1976. Inference and missing data.

  • Schafer, Joseph L and Graham, John W. 2002. Missing data: our view of the state of the art..