Why Microsoft trained an LLM to forget about Harry Potter

Training Language/Multimodal Models to forget selectively is an exciting field of AI

3 min readNov 15, 2023

Can we make LLMs forget the knowledge they’ve learned? Keep reading to find out about how Microsoft researchers made Meta’s Llama model forget Harry Potter.

This might seem silly at first, but this can be extremely important. If you want your model to produce certain answers/generations that occur rarely in the underlying data distribution, then forgetting could be the way to go. This would be more efficient than oversampling. Additionally, when it comes to the issue of copyright, concerns about private information, biased content, false data, and even toxic or harmful data- you might want to take away information from LLMs.

But how do you accomplish this? After all, unlearning isn’t as straightforward as learning. To analogize, imagine trying to remove specific ingredients from a baked cake — it seems nearly impossible. Fine-tuning can introduce new flavors to the cake, but removing a specific ingredient? That’s a tall order.

However, that is precisely what some researchers from Microsoft did. In the publication, “Who’s Harry Potter? Making LLMs forget”, the authors say, “we decided to embark on what we initially thought might be impossible: make the Llama2–7b model, trained by Meta, forget the magical realm of Harry Potter.” The results can be seen in the image below-

How did they accomplish this? The technique leans on a combination of several ideas:

1. Identifying tokens by creating a reinforced model: We create a model whose knowledge of the unlearn content is reinforced by further fine-tuning on the target data (like Harry Potter) and see which tokens’ probabilities have significantly increased. These are likely content-related tokens that we want to avoid generating.

2. Expression Replacement: Unique phrases from the target data are swapped with generic ones. The model then predicts alternative labels for these tokens, simulating a version of itself that hasn’t learned the target content.

3. Fine-tuning: With these alternative labels in hand, we fine-tune the model. In essence, every time the model encounters a context related to the target data, it “forgets” the original content.

To read more about this, check out their writeup here-https://www.microsoft.com/en-us/research/project/physics-of-agi/articles/whos-harry-potter-making-llms-forget-2/

For more details, sign up for my free AI Newsletter, AI Made Simple. AI Made Simple- https://artificialintelligencemadesimple.substack.com/

If you liked this article and wish to share it, please refer to the following guidelines.

If you find my writing useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.

Support AI Made Simple