Missing Values in a Live Prediction Model (Take 2)

[ prediction  machine-learning  deployment  easi  ]

Ok, so it’s been about a week of reading, thinking, toying around… My original objective was looking into various ways to treat missing values in categorical variables with an eye towards deploying the final predictive model. Reading over several ideas that included continuous variables (e.g., Perlich’s missing indicator and clipping techniques), I’ve re-scoped it a bit to missing values in general, specifically for predictive models.

I found, if you look hard enough, people definitely have thought about prediction models in practice. Not that I assumed this wasn’t the case. It was hard to find though.

What surprised me is how much generic advice about CCA, mean imputation, etc, is given on so many various StackExchange and Quora posts, and various blogs. I read so many bits where the writer wouldn’t even bat an eye when recommending CCA for dealing with missing data when developing a prediction model… Like, when the fuck does the writer think the prediction model is going to be used? On a castrated, CCA’d-ass test set? Hardly!

Anyway, I didn’t want to distort my first article and continue updating it… It needed to be journaled, and now updated in the sequel. I specifically sought out a bunch of articles today where the authors very much keep in mind the fact that someone, somewhere ultimately will want to use the predictive model in an environment with a data generating process similar to the one that produced the original data set – missing values and all!

For context, my original article was motivated out of curiosity: I was working with some data and my approach was to treat missing values as in a nominal categorical variable as another level. But I questioned whether or not this was really a great approach… I wanted to see what others had done. I kept reading about listwise-deletion and CCA, and it made my eyes cross: why are so many people recommending these things when talking about predictive models? There was a particular Quora question where the first several authors gave the standard advice, and then came Claudia Perlich’s answer: yes, she was actually answering the question that was asked!!!!!!!!

Below, I just include a list of the references I’ve found, sometimes with a note or two. Over time, I hope to continue this series of articles by providing some summaries of these references.

Side Note: Multiple Imputation

As I found in previous articles, various implementations of multiple imputation seem to be considered “best in class” – but, if you read more carefully, this is specifically geared to non-predictive settings, like statistical inference, unbiased population estimates, causal modeling, etc.

It is not obvious or clear that this “best in class” title for imputation carries over to the predictive setting. For example, multiple imputation creates multiple “artificial” records for each missing value, ultimately creating a data set that preserves the variance of its variables (as opposed to someting like median imputation, which does not).

But when would we want to create multiple records in prediction? Maybe to generate multiple prediction then aggregate (e.g., take the mean).

Seems like R package MICE for predictive mean matching gets a lot of attention and callouts:

  • e.g., https://statistical-programming.com/predictive-mean-matching-imputation-method/
  • eBook: https://stefvanbuuren.name/fimd/sec-pmm.html

So given that combining an imputation method with missing indication often works best for prediction, I want to see if “MICE + indication” is the best in that class.


Reading List

Missing Value Misc

Differences between the explanatory and predictive modeling settings

Some Blogs & Stuff

Written on March 29, 2019