Using Surrogate Gradients and STE in Machine Learning

Solving one of Neural Network’s Biggest Challenges

3 min readAug 17, 2024

Neural Networks are very powerful but they are held back by one huge weakness- their reliance on gradients. When building solutions in real-life scenarios, you won’t always have a differential search space to work with, making gradient computations harder. Let’s talk about a way to tackle this-

Straight Through Estimators (STEs)

STEs address this by allowing backpropagation through functions that are not inherently differentiable. Imagine a step function, which is essential in many scenarios, but its gradient is zero almost everywhere. STEs bypass this by using an approximate gradient during backpropagation. It’s like replacing a rigid wall with a slightly permeable membrane, allowing information to flow even where it shouldn’t, mathematically speaking.

Surrogate Gradients

Similar to STEs, surrogate gradients offer a way to train neural networks with non-differentiable components. They replace the true gradient of a function with an approximation that is differentiable. This allows backpropagation to proceed through layers that would otherwise block the flow of gradient information.

Why They Matter

These techniques are invaluable for:

Binarized Neural Networks: where weights and activations are constrained to be either -1 or 1, greatly improving efficiency on resource-limited devices
Quantized Neural Networks: where weights and activations are represented with lower precision, reducing memory footprint and computational cost
Reinforcement Learning: where actions might be discrete or environments might have non-differentiable dynamics

Fundamentally, surrogate training elements (STEs) and surrogate gradients serve as powerful tools that bridge the gap between the abstract world of gradients and the practical constraints of problem-solving. They unleash the full potential of neural networks in scenarios where traditional backpropagation falls short, allowing for the creation of more efficient and flexible solutions.

One powerful use-case we’ve recently seen with them has been the implementation of Matrix Multiplication Free LLMs, which use surrogate gradients (STE) to handle the ternary weights and quantization. By doing so, they are able to drop their memory requirements by 61% in unoptimized kernels and 10x in optimized settings.

Read more about MatMul Free LLMs and how they use STE over here-

Beyond MatMul: The New Frontier of LLMs with 10x Efficiency [Breakdowns]

Scalable MatMul-free Language Modeling (Nvidia hates this one trick)

artificialintelligencemadesimple.substack.com

If you like this article, please consider becoming a premium subscriber to AI Made Simple so I can spend more time researching and sharing information on truly important topics. We have a pay-what-you-can model, which lets you support my efforts to bring high-quality AI Education to everyone for less than the price of a cup of coffee (click here to learn more).

Subscribe to Artificial Intelligence Made Simple

Covering the implications of important ideas in AI from all angles- technical, social, and economic. Read in over 180…

artificialintelligencemadesimple.substack.com

I provide various consulting and advisory services. If you‘d like to explore how we can work together, reach out to me through any of my socials over here or reply to this email.

I regularly share mini-updates on what I read on the Microblogging sites X(https://twitter.com/Machine01776819), Threads(https://www.threads.net/@iseethings404), and TikTok(https://www.tiktok.com/@devansh_ai_made_simple)- so follow me there if you’re interested in keeping up with my learnings.