How to remove Matrix Multiplications from LLMs

10x Efficiency, Good Performance, and comparable scaling

4 min readJan 6, 2025

MatMul Free LLMs were one of my favorite inventions last year. They achieved 10x the efficiency, very good performance, and very encouraging scaling.

Let’s learn how they did it.

Self-attention, a common mechanism for capturing sequential dependencies in LLMs, relies on expensive matrix multiplications and pairwise comparisons. This leads to quadratic complexity (n²).

The paper adapts the GRU (Gated Recurrent Unit) architecture to eliminate MatMul operations. This modified version, called MLGRU, uses element-wise operations (like additions and multiplications) to update the hidden state instead of MatMul.

Key ingredients-

Ternary weights: All the weight matrices in the MLGRU are ternary, further reducing computational cost.

Simplified GRU: The MLGRU removes some of the complex interactions between hidden states and input vectors, making it more efficient for parallel computations.

Data-dependent output gate: The MLGRU incorporates a data-dependent output gate, similar to LSTM, to control the flow of information from the hidden state to the output.

The MatMul-free Channel Mixer is worth exploring further. It has-

Channel mixing: This part mixes information across the embedding dimensions. The paper replaces dense layers + MatMul with BitLinear layers. Since BitLinear layers use ternary weights, they essentially perform additions and subtractions (much cheaper).

Gated Linear Unit (GLU): The GLU is used for controlling the flow of information through the channel mixer. It operates by multiplying a gating signal with the input, allowing the model to focus on specific parts of the input.

Quantization: The model also quantizes activations (the output of a layer) using 8-bit precision. This reduces the memory requirements significantly

RMSNorm: To maintain numerical stability during training and after quantization, the model uses a layer called RMSNorm (Root Mean Square Normalization) to normalize the activations before quantization.

Surrogate gradients: Since ternary weights and quantization introduce non-differentiable operations, the model uses a surrogate gradient method (straight-through estimator) to enable backpropagation.

Larger learning rates: The ternary weights result in smaller gradients compared to full-precision weights. This can lead to slow convergence or even failure to converge. To counteract this, the paper recommends employing larger learning rates than those typically used for full-precision models. This facilitates faster updates and allows the model to escape local minima more efficiently.

LR Scheduler- “We begin by maintaining the cosine learning rate scheduler and then reduce the learning rate by half midway through the training process.

Fused BitLinear layer: This optimization combines RMSNorm and quantization into a single operation, reducing the number of memory accesses and speeding up training.

The research is very interesting and I hope to see more. Drop your favorites in LLM research below.

Learn more about MatMul Free LLMs here-

Beyond MatMul: The New Frontier of LLMs with 10x Efficiency [Breakdowns]

Scalable MatMul-free Language Modeling (Nvidia hates this one trick)

artificialintelligencemadesimple.substack.com

I put a lot of work into writing this newsletter. To do so, I rely on you for support. If a few more people choose to become paid subscribers, the Chocolate Milk Cult can continue to provide high-quality and accessible education and opportunities to anyone who needs it. If you think this mission is worth contributing to, please consider a premium subscription. You can do so for less than the cost of a Netflix Subscription (pay what you want here).

Help me buy chocolate milk

If you liked this article and wish to share it, please refer to the following guidelines.

That is it for this piece. I appreciate your time. As always, if you’re interested in working with me or checking out my other work, my links will be at the end of this email/post. And if you found value in this write-up, I would appreciate you sharing it with more people. It is word-of-mouth referrals like yours that help me grow. You can share your testimonials over here.