How to Properly Compare Machine Learning Models

2 min readApr 30, 2024

What if I told you that there’s a lot of ML Teams are getting their model evaluations wrong?

Many teams invest a lot of money into their model training, leaving them with lower budgets for the evaluations. This leads to inaccurate evaluations and flimsy benchmarks. In the worst case, this will lead to your team picking the wrong model for the task.

The publication, “Accounting for Variance in Machine Learning Benchmarks” provides great recommendations for handling these issues. Here are some ways you can improve your ML Model evaluations and improve your benchmarks.

1. Randomize as many sources of variations as possible

Good model comparisons will have a lot of randomized choices. Think back to a lot of the arbitrary choices we make during our machine learning process. The random seed for initializations, data order, how we initialize the learners, etc. Randomizing these will allow for better-performing models. To quote the paper-

“a benchmark that varies these arbitrary choices will not only evaluate the associated variance (section 2), but also reduce the error on the expected performance as they enable measures of performance on the test set that are less correlated (3). This counter-intuitive phenomenon is related to the variance reduction of bagging (Breiman, 1996a; Buhlmann et al., 2002), and helps characterizing better the expected behavior of a machine-learning pipeline, as opposed to a specific fit”

I found the comparison to bagging particularly interesting. This is why I recommend taking some time to go over various ML concepts etc. It will help you come across ideas and associations to understand things better and be innovative.

2. Use Multiple Data Splits

Most people use a single train-test-validation split. They will batch their data once and be done with it. More industrious people might also run some cross-validation. I would recommend also playing around with the ratios used for building the sets. In the words of the team, “For pipeline Accounting for Variance in Machine Learning Benchmarks comparisons with more statistical power, it is useful to draw multiple tests, for instance generating random splits with a out-of-bootstrap scheme(detailed in appendix B).”

Account for variance to detect meaningful improvements

It’s important that you always remember that there is a degree of randomness in your results. Running multiple tests is one way to reduce it. But it will never go away unless you go through every possible permutation (this might be impossible, and definitely needlessly expensive). Minor improvements might just be a result of random chance. When dealing with models, always keep a few close-performing ones on hand.

To learn more about this publication and How to Properly Compare Machine Learning Models-