Some observations on one of Machine Learning’s most controversial fields.
Reinforcement learning is one of the more contentious fields in AI.
Once hyped as the field that would lead to human experts everywhere being replaced, it’s now fallen short of expectations. AI Legend Yann LeCun had this to say about the impact of RL
That being said, I believe that RL has some very interesting applications when it comes to platform testing and security. I will be doing a breakdown of how in an upcoming piece combining old Pokémon games, RL, and discovering Glitches on our sister publication AI Made Simple. Consider this a primer for that.
Understanding Reinforcement Learning
What is Reinforcement Learning- RL is one of the big 3 paradigms of ML Research (along with supervised and unsupervised learning). RL is based on teaching a model to maximize a reward function.
Setting up RL Agents-
Usually I would talk about why it’s useful first, but in this case, talking about how we set up RL agents will give us a strong indication on where it excels. RL problems require a few different components-
- Agent: The agent is the entity that is learning to behave in the environment. It acts and observes the resulting rewards and penalties, tweaking future action based on the feedback it receives.
- Environment: The environment is the world in which the agent operates. It provides the agent (and programmers) with states and rewards.
- Policy: The policy is a function that maps states to actions. The agent uses the policy to decide which action to take in any given state.
- Reward: The reward is a signal that the environment provides to the agent after it takes an action. The reward indicates how well the agent is doing.
When it makes sense to use Reinforcement Learning- Generally we see Reinforcement Learning to teach AI models to play games, develop investment strategies, develop other skills (generating better text for ChatGPT, making coffee, driving etc). Based on my analysis on RL and its uses, I have a checklist of two items that combine to give us an RL friendly setup:
- You can’t boil the process into a data-point: RL-friendly processes are inherently hyper-relational (they rely on information from previous states) and have a continuous element to them. This makes meaningfully labelling them for supervised learning a giant pain (it is possible though). You can label every little step/jump Mario makes and build a full state tree, or you can let an AI agent take control of that psychotic little turtle hating mushroom addict and just let the AI just fuck around and find out. One requires a lot of manpower. The other lets you run the code in the background while you go hard-sparring with your bros and still call it work. The choice is yours.
- You know what you want- Generally speaking, RL Researchers spend most of their time tweaking the Reward Function to account for the many ways AI invariably misbehaves. As long as you know what you want the AI to do (explore map, take coin, don’t die etc)- you can generally make the changes relatively quickly (the process is simple, but not always easy). In the same way that Supervised Learning reduced a lot of work by allowing engineers to build AI without explicitly feeding it the relationships b/w the targets and features, RL saves dev time by not forcing devs/researchers to handhold their agent through every step.
Why RL is useful- RL is valuable not because it behaves like a human, but precisely because it will not. It will accurately show you all the ways your system can be messed up. Look at any video where someone trained an RL agent, and you will see all kinds of examples where the agents find completely unexpected loopholes. In my upcoming Pokémon article, you will see that the Agent figured out that it was losing a battle. Instead of continuing and taking an L, it refused to do anything. By doing nothing, it was technically never defeated. Other agents also discovered rng exploits, completely unprompted. All of this has a lot of implications for doing security research in increasingly complex tech-stacks. By behaving in ways that normal humans would not, it can add a lot to security testers.
There’s been greater investment into RL recently, and as LLMs hit their limitations of scale and cost, we’re going to see calls for looks into alt-AI. RL will definitely make a comeback, and you should know about it. To end on a fun note, here is a video where DeepMind (one of Google’s AI Research Groups) taught an AI to walk-
If you liked this article and wish to share it, please refer to the following guidelines.
If you find AI Made Simple useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.
Reach out to me
Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.
Small Snippets about Tech, AI and Machine Learning over here
AI Newsletter- https://artificialintelligencemadesimple.substack.com/
My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819