A simple introduction to MIT’s Lottery Ticket Hypothesis

One of Deep Learning’s most exciting ideas

5 min readFeb 5, 2024

The Lottery Ticket Hypothesis (LTH) is the ultimate representation of the 80–20 principle in Deep Learning. It posits that within randomly initialized, dense neural networks lie subnetworks capable of achieving the same performance as the full network after training, but with significantly fewer parameters. If true, it promises smaller, faster, and more efficient models, potentially revolutionizing neural network architectures. If we could find the winning networks immediately/early, we could directly train the subnetwork significantly boosting ROI.

Given the recent swing back towards “Small Language Models” (who saw that coming), I figured this would be a good time to touch on this idea in more detail.

We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. .. The winning tickets we find have won the initialization lottery: their connections have initial weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of experiments that support the lottery ticket hypothesis and the importance of these fortuitous initializations. We consistently find winning tickets that are less than 10–20% of the size of several fully-connected and convolutional feed-forward architectures for MNIST and CIFAR10.
Above this size, the winning tickets that we find learn faster than the original network and reach higher test accuracy.
-The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks

Unearthing the Winning Tickets:

The LTH hinges on the concept of iterative magnitude pruning. There are two different strategies tested-

“The difference between these two strategies is that, after each round of pruning, Strategy 2 retrains using the already-trained weights, whereas Strategy 1 resets the network weights back to their initial values before retraining... In all cases, the Strategy 1 maintains higher validation accuracy and faster early-stopping times to smaller network sizes.”

Figure 9: The early-stopping iteration and accuracy at early-stopping of the iterative lottery ticket experiment on the Lenet architecture when iteratively pruned using the resetting and continued training strategies.

This and other experiments from the paper support the main hypothesis in LTH- “the original initialization withstands and benefits from pruning, while the random reinitialization’s performance immediately suffers and diminishes steadily.”

Out of scope for this paper, but if you are looking to benefit from random initializations to explore the search space better- your best bet is to keep things simple and utilize an ensemble. Even an ensemble of relatively simple models can work wonders, without raising your costs too much.

Combining insights from different initialization can lead to better performance.

Why Does LTH work?

There are several possible reasons that LTH might help us find winners-

Favorable Gradients: Certain subnetworks might possess inherent properties that lead to more favorable gradients during training, allowing them to learn efficiently despite pruning.
Implicit Regularization: Pruning might act as an implicit regularization technique, preventing overfitting and promoting generalization.
Data Alignment: The winning tickets could align better with the inherent structure of the data, allowing them to capture the essential information with fewer connections. This might seem strange, but I want you to remember that researchers did show that one of the reasons that Trees outperform NNs on Tabular Data is that NNs are biased towards overly smooth solutions, and that regularization can be key in helping them learn more jagged patterns.

Further research is crucial to solidify our understanding and develop methods for reliably identifying and harnessing winning tickets. If any of you are looking for a problem to solve, take a crack at this.

Challenges and Opportunities:

Despite its potential, the LTH faces several challenges on the path to practical application. Imo, these are the most pressing:

Scalability: Finding winning tickets can be computationally expensive, especially for large networks. Developing efficient search algorithms is crucial for broader applicability.
Interpretability: Understanding why certain subnetworks emerge as winners remains a challenge. Making progress here will also improve our understanding of Deep Networks significantly, which will be a massive W.

Along with these, usual suspects like Adversarial Robustness and Stability against corrupted data are true with all ML-related challenges.

Addressing these challenges is essential for unlocking the full potential of the LTH. Recent advancements, such as using knowledge distillation to transfer knowledge from the full network to the winning ticket, offer promising avenues for progress.

To those of you looking for deeper immersion into AI, my sister publication, AI Made Simple, would be a great tool for you.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. If you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

Save the time, energy, and money you would burn by going through all those videos, courses, products, and ‘coaches’ and easily find all your needs met in one place at ‘Tech Made Simple’! Stay ahead of the curve in AI, software engineering, and the tech industry with expert insights, tips, and resources. 20% off for new subscribers by clicking this link. Subscribe now and simplify your tech journey!

Using this discount will drop the prices-

800 INR (10 USD) → 640 INR (8 USD) per Month

8000 INR (100 USD) → 6400INR (80 USD) per year (533 INR /month)

Get 20% off for 1 year