5 Techniques you can use to large data models in your systems without breaking the bank

How to train Large Models like GPT 4o efficiently [Summary]

6 min readMay 15, 2024

Below is a summary of the important ideas from this article. Read it to get more details/get original source links.

Lots of AI People want to build big AI Models like GPT 4. Let’s talk about some techniques that will let you scale up your Models without breaking the bank. These techniques will allow you to scale up your AI Models, improving the expressivity of your system without driving up costs significantly

This image is courtesy of the Pathways system which I covered here

1) Batch Size:

Increasing batch size can reduce training time and cost, but may impact generalization. It has been well-noted by AI Researchers that increasing batch size messes up your accuracy and generalization. There is even a well-known term for the lower generalization of large batch training- the generalization gap. About that- it’s a myth. It certainly does exist, if you increase the batch size and do nothing else.

If you just increase batch sizes w/o changing anything else, your model will get stuck in sharper minima. That is the reason behind the generalization gap This was shown by the paper On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.

This trade-off can be mitigated with techniques like “Ghost Batch Normalization”, as suggested in the paper “Train longer, generalize better: closing the generalization gap in large batch training of neural networks”.

There are also other techniques to overcome this limitation. All of these will allow you to maximize on the savings of larger batches without missing out on performance.

2) Active Learning:

Here’s a pretty simple idea- if you have a pretrained model, there are data points that are easier and other data points that are harder for it to model. The data points that are harder to work with have more potential information for your model. Thus, it makes sense to focus training on that ignoring the data points your model considers easy. If Erling Haaland wants to graduate from being a “League 2" player, then he is better off training with difficult opposition instead of me.

One great implementation of this is Meta’s “Beyond neural scaling laws: beating power law scaling via data pruning”.

Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet. Given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.

3) Increasing the Number of Tokens:

Research from Deepmind’s paper “Training Compute-Optimal Large Language Models” emphasizes the importance of balancing the number of parameters with the number of training tokens in language models to achieve better performance at a lower cost. If you’re into LLMs, would highly recommend reading this paper b/c it’s generational.

4) Sparse Activation:

Algorithms like Sparse Weight Activation Training (SWAT) can significantly reduce computational overhead during training and inference by activating only a portion of the neural network. 5/7 must know idea. Let’s talk about it.

Think back to how Neural Networks work. When we train them, input flows through all the neurons, both in the forward and backward passes. This is why adding more parameters to a Neural Network adds to the cost exponentially.

Adding more neurons to our network allows for our model to learn from more complex data (like data from multiple tasks and data from multiple senses). However, this adds a lot of computational overhead.

For ResNet-50 on ImageNet SWAT reduces total floating-point operations (FLOPS) during training by 80% resulting in a 3.3× training speedup when run on a simulated sparse learning accelerator representative of emerging platforms while incurring only 1.63% reduction in validation accuracy. Moreover, SWAT reduces memory footprint during the backward pass by 23% to 50% for activations and 50% to 90% for weights.

Sparse Activation allows for a best-of-both-worlds scenario. Adding a lot of parameters allows our model to learn more tasks effectively (and make deeper connections). Sparse Activation lets you use only a portion of your network, cutting down your inference. This allows the network to learn and get good at multiple tasks, without being too costly.

5) Filters and Simpler Models:

Instead of relying solely on large models, it is often more efficient to use simpler models or filters to handle the majority of tasks, reserving the large model for complex edge cases. You’d be shocked how much you can accomplish with RegEx, rules, and some math.

By combining these strategies, we can unlock the potential of large AI models while minimizing their environmental impact and computational costs. As Amazon Web Services notes, “In deep learning applications, inference accounts for up to 90% of total operational costs”, making these optimizations crucial for widespread adoption.

Once again, to learn more about these techniques, read the following-

How to build Large AI Models like ChatGPT efficiently

The techniques you can use to use large data models in your systems without breaking your bank

open.substack.com

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow.

I put a lot of effort into creating work that is informative, useful, and independent from undue influence. If you’d like to support my writing, please consider becoming a paid subscriber to this newsletter. Doing so helps me put more effort into writing/research, reach more people, and supports my crippling chocolate milk addiction. Help me democratize the most important ideas in AI Research and Engineering to over 100K readers weekly.

Help me buy chocolate milk

PS- We follow a “pay what you can” model, which allows you to support within your means. Check out this post for more details and to find a plan that works for you.

I regularly share mini-updates on what I read on the Microblogging sites X(https://twitter.com/Machine01776819), Threads(https://www.threads.net/@iseethings404), and TikTok(https://www.tiktok.com/@devansh_ai_made_simple)- so follow me there if you’re interested in keeping up with my learnings.

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819