The best way to handle missing data

Which imputation technique gives you the best result when it comes to handling missing data in multiple types?

8 min readApr 13, 2023

Missing data is a very common issue when working with Machine Learning Data in the real world. Sensors can break. Invalid data can be recorded. Survey information might not be filled in completely. A lot can go wrong. So what do we do? We can drop the data that is incomplete. But what if we end up with a small dataset? What if we drop very important samples? When I work with data, I almost never drop data points. At worst this adds some noise to your learning, which is probably better in the long run for building generalizable models.

The datasets used. Note the use of all 3 attribute types.

Instead, I prefer imputing the missing data. This just means filling in the missing data using some rules. Your specific imputing policy is determined by a lot of factors. The authors of the paper, “A computational study on imputation methods for missing environmental data” go over 3 different data imputation policies to find the best. In this article, I will talk about interesting findings from the paper. I will also share the positives about the experiment setup, that you should take into your machine learning projects. Let me know which of the points was most interesting to you in the comments below (or through DMs). I’d love to learn more about what it sticks out to you guys.

A computational study was performed to compare the method missForest (MF) with two other imputation methods, namely Multivariate Imputation by Chained Equations (MICE) and K-Nearest Neighbors (KNN). Tests were made on 10 pretreated datasets of various types. Results revealed that MF generally outperformed MICE and KNN in terms of imputation errors, with a more pronounced performance gap for mixed typed databases where MF reduced the imputation error up to 150%, when compared to the other methods.
-A summary of the paper and its results

The Positives

Following are some of the things that the team did, that you should do in your projects/whitepapers.

Defining the Problem + Constraints Clearly

One of the best things you can do for your machine learning projects is to sketch out every challenging aspect. Mention what the challenge is, why it’s problematic, and what you would consider being an acceptable solution. This gives your project a lot of clarity. For example, the paper explains the challenges of working with environmental data very well. In the words of the authors:

“Organizing environmental data in well-structured databases is a challenging task (Blair et al., 2019). On the one hand, the natural environment is impacted by human activities, and this calls for interdisciplinary research and analysis. On the other hand, natural phenomena cover different time and spatial scales and are generally interconnected, which makes data integration difficult. This typically results in heterogeneous data sources and generally gives rise to databases of a mixed nature, with both qualitative and quantitative entries.”

Identifying the stumbling blocks can help in designing the solutions. Alternatively, you can make some simplifying assumptions, and just make a note of the complexities (we did this a lot for my work with modeling global supply chains and using the models to predict supplier risk based on past behavior and financial/economic signals). Whichever route you take, having clearly defined challenges helps you create the solution.

Clearly defining constraints/challenges also helps other people understand your thought process when working in teams. This makes collaboration more effective. This makes it a must when integrating it into your asynchronous communication tools such as your documentation. If you’d like to create better documentation, this guide will help you.

Accounting for Variance

Datasets can have a lot of variances. Both in terms of the percentage missing and the nature/distribution of the features tracked. The authors of this paper acknowledged this and accounted for them both. When describing phase 1 of the paper, they had this to say about the experiment setup, “we selected 10 datasets from various sources in the literature and artificially obtained various degrees of missing data by randomly removing some of the entries. The set of selected databases was chosen to be representative of the typical characteristics found when analyzing environmental data, such as varying dimensions, as well as heterogeneous data types and structural features.”

Note that they account for both variances in the degree of missingness (dropping differing amounts of data) and nature (using different databases). This is extremely good practice for your own projects. Pay special attention to their practice of dropping data from complete datasets. This allowed them to compare results accurately.

You might think this is trivial. But, you’d be shocked at how random ML Evaluations can be, even at the highest levels. Take a look at this passage from the excellent paper- “Accounting for Variance in Machine Learning Benchmarks”. It tells us that many of the performance improvement ‘model improvements’ end up being random chance, and not true improvement. This leads to problems when people try to integrate a model/technique from publications into their own projects and see no results.

Looking at Performance

Now to answer the question you clicked on this article for. What should you do? Overall, the paper showed missForest to be the best data imputation policy (in terms of error). As mentioned, the other ones they used were Multivariate Imputation by Chained Equations (MICE) (by Buuren and Oudshoorn, 1999) and K-Nearest Neighbors (KNN) (by Troyanskaya et al., 2001). The rest of this section will go into the results of different experiments, and how these models held up against each other in a Mortal-Kombat style death battle.

Straight outta the abstract

Qualitative Datasets

For qualitative datasets, we see that increasing missingness increases the error (PFC). This is not shocking. Tic Tac Toe is an exception to this and should be studied because of it’s interesting behavior. If any of you have insight into this dataset, I would love to hear it.

The authors had this to say:

“Even if KNN is systematically the least performing IM, neither MICE nor MF stands out from the other IMs. On average over the 1000 simulations, MF is the most performing IM on “Lanza” whereas MICE outperforms MF on “Hayes” and “Tic-Tac-Toe”. However, because of the significant rise in MICE errors on the “Tic-Tac-Toe” case, it loses its advantage as the missing data percentage increases.”

Quantitative Datasets

Above are error calculations using NRMSE as our metric for quantitative data. MF, in general, outperforms the other policies in almost every case. The authors have interesting comments about colinearity and the trend. I would suggest reading the section to get them. I won’t mention them here to keep the article concise.

Mixed Data

For mixed data, a combination of PFC and NRMSE is used at varying percentages of missingness. We see MF standing out as a clear winner here. To quote the papers, “A comparison between the respective performances of the three IMs on the graphs of Figure 4 show that MF outperforms MICE and KNN in every case.”

Simply put, you will almost never go wrong with using missForest to impute your missing environmental data.

A note on Processing Times

The team also looked into processing times for their code. While this is generally not a concern (imputation need only be done once) it’s still an important aspect. If you are extremely cost-constrained this is what they discovered:

TL;DR- MICE is slow.

Closing

As a Forest Supremacist, I am very pleased with the results. On a more serious note, this paper has a lot to teach. I struggled with what to write because I could have done 3 different articles here. In the end, this particular topic seemed the most valuable. However, make sure you read the paper (especially the case study). The authors have done very cool work. If you want a follow-up to this, let me know.

An interesting extension to the paper could have been to evaluate the complexity of the policies being used. Below is a video explaining the Bayesian Information Criterion, which could have been a useful base here as an alternative to time.

If you like what you read, I am now on the job market. My resume can be found over here. A quick summary of my skill set-

Machine Learning Engineer- I have worked on various tasks such as generative AI + text processing, modeling global supply chains, evaluating government policy (impacting over 200 Million people), and even developing an algorithm to beat Apple on Parkinson’s Disease detection.
AI Writer- 30K+ email subscribers, 2M+ impressions on LinkedIn, 600K+ blog post readers over 2022.

If you would like to speak more, you can reach me through my LinkedIn here.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.