How RWKV is bringing RNNs back in NLP

Will the additional computational cost of the attention mechanism, RNNs might be the way forward

3 min readMar 17, 2024

Transformers have been revolutionary for LLMs, but is it the end of the road for them?

The self-attention mechanism in Transformers allowed them to get to unprecedented scales, hitting unmatched performance in both Vision and Language, where they toppled CNNs and RNNs on many benchmarks (CNNs held up much better than RNNs). However, every pro has it’s con, and the attention mechanism- which enabled very deep relationships between input tokens/patches- was also hampered by very high computational costs. This has finally caught up to them.

Some interesting recent research has tried to see how we can replace transformers with more efficient architectures. RWKV is one of the most candidates for that. RWKV is an RNN, with a special variant of the attention mechanism, token shifting, and channel mixing: all of which enable longer-form memory and training parallelization. This allows RWKV to match the scale of Transformers while keeping the inference efficiency of RNNs.

*Figure 7: Cumulative time on text generation for LLMs. Unlike transformers, RWKV exhibits linear scaling.*

The attention mechanism, one of the driving forces behind Transformer supremacy, causes Transformers to “suffer from memory and computational complexity that scales quadratically with sequence length”. This limits their scalability. On the other hand, RNNs, which are efficient for inference, tend to suffer a lot of performance degradation and can’t be parallelized for training. RWKV (Receptance Weighted Key Value) is a new architecture that aims to hit both the performance/scale of Transformers and the efficiency of RNNs.

According to the ML energy leaderboard for OS LLMs, RWKV is by far the most efficient per token (Llama has lower energy requirements but also way shorter responses).

In many ways RWKV embodies the open source more truly than any other LLM- it’s efficient, truly multi-lingual, and has built always relied on global grassroots community support. The team is about to drop a new model soon. So I figured now would be a good time to cover that project and share my analysis on what makes it tick.

Read more about RWKV and how it’s got potential to shake things up in NLP here:

A look into RWKV: A more efficient answer to Transformers? [Breakdowns]

How an RNN Style Architecture is challenging the dominance of Transformers in NLP

artificialintelligencemadesimple.substack.com

If you liked this article and wish to share it, please refer to the following guidelines.

If you find my writing useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.

Support AI Made Simple