Why you should analyze the distribution of your Data

One of the most underrated steps when building intelligent systems

5 min readMar 21, 2023

Join 31K+ AI People keeping in touch with the most important ideas in Machine Learning through my free newsletter over here

Recently something very interesting happened to me,

LinkedIn invited me to contribute to their advice pieces geared at helping developers and other tech people get better.

This is obviously a huge W for me, and I’m excited to see the opportunities it presents. As I looked through some articles they had asked me to contribute to, one stood out How do you apply data transformations to improve model performance in EDA? (go engage with the article to support me). However, the character limit for contributions was too little so I thought I’d do a more fleshed-out piece on this post.

In this email/post, we will discuss why analyzing the distribution of your data is important, the common techniques to do so, and some common mistakes people make. This post will be longer than usual since we will be highlighting many important ideas. I will elaborate upon the points mentioned here in further articles.

Why is analyzing the distribution of your data important?

Analyzing the distribution of your data is important because it can have a significant impact on the results of your analysis. Understanding the distribution of your data can help you choose the right statistical test, identify outliers, check for normality, and visualize the data. By understanding the distribution of your data, you can ensure that your results are accurate, reliable, and valid.

It can also serve another function, one that is often overlooked by data scientists- Confirming your data collection systems. If you have ground truth about what your distribution is supposed to look like, checking your collected data distribution can help you identify both data drift and problems in your collection systems. Both of these problems can slip under the hood and mess up protocols. Doing an analysis of the Data Distribution collected and comparing it to ground truth can be a great way to confirm that systems are working as they should.

Common techniques to analyze the distribution of your data

Histograms: Histograms are graphical representations of the distribution of numerical data. They display the frequency or proportion of data points in different ranges. Histograms can give you a quick visual idea of the shape of your distribution.
Boxplots: Boxplots are graphical representations of the distribution of numerical data through their quartiles. They display the median, the upper and lower quartiles, and the minimum and maximum values. Boxplots can show you how the data is distributed, whether it is skewed, and whether there are any outliers.
Quantile-Quantile (Q-Q) Plots: The quantile-quantile (q-q) plot is a graphical technique for determining if two data sets come from populations with a common distribution. It is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight (linear). Very useful if you are testing your data distribution for an underlying property.
Descriptive statistics: Basic statistics, such as mean, median, mode, skewness, and kurtosis, can provide a summary of the distribution of your data. However, there is a problem with relying exclusively on these? Do you remember what it was? We discussed it in an earlier piece. The answer comes later in this email, in the mistakes section.
Statistical tests: Statistical tests, such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test, can formally test the normality of your data or compare it to a known distribution.

Common mistakes people make when analyzing the distribution of their data

Ignoring outliers: Outliers can distort the analysis and affect the overall results. It’s important to identify and handle outliers appropriately. There are times when you might judge that the outlier is not worth modeling for these outliers (the ROI is not worth it). In that case, you can ignore them. But this should be a deliberate choice.
Dropping outliers: This is a cardinal sin (one of the few things I generally tell people to never do). Yes, this can improve your model performance. But what good is that performance, if you can’t model real-life data? Generally, outliers will help you improve generalization and robustness which is worth it for a worse raw performance. Once again, this is different if you decide that the ROI of working with these is not worth it.
Using the wrong statistical test: Different statistical tests have different assumptions about the distribution of the data. If you don’t understand the distribution of your data, you may end up using the wrong statistical test, which can lead to incorrect conclusions.
Not visualizing the data: Visualizing the distribution of the data can help you identify patterns and trends that might not be apparent from just looking at the raw data. This can help you generate hypotheses, test assumptions, and identify potential problems with the data. This was the mistake I mentioned earlier, related to relying just on descriptive statistics. Sometimes, very different distributions can have identical descriptive stats. We discussed this phenomenon in the post- When Statistics Lie. Anscombe’s Quartet [Math Mondays]

In conclusion, analyzing the distribution of your data is crucial before conducting any data analysis. It can help ensure that your methods are appropriate, your results are accurate, and your conclusions are valid. By using common techniques and avoiding common mistakes, you can improve the quality of your data analysis and make more informed decisions.

That is it for this piece. I appreciate your time. As always, if you’re interested in reaching out to me or checking out my other work, links will be at the end of this email/post. If you like my writing, I would really appreciate an anonymous testimonial. You can drop it here. And if you found value in this write-up, I would appreciate you sharing it with more people.

Upgrade your tech career with a premium subscription ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!