Shruti Turner.

Reinforcement Learning: Meet Agent Bonnie

Reinforcement LearningData ScienceData ScientistMachine Learning EngineerML Engineering
Cover Image for Reinforcement Learning: Meet Agent Bonnie

Shruti Turner - Own Photo

What is Reinforcement Learning?

A subset of Machine Learning that focuses on training algorithms to make sequences of decisions in an environment to achieve a specific goal.

I don't know about you, but that definition doesn't really shed that much light on what the details of what Reinforcement Learning (RL) is. It's quite a technical high level definition, and a version of the common definitions you might get if you searched for Reinforcement Learning online.

The way I think about Reinforcement Learning, and how I'd define it if someone asked is...

A subset of Machine Learning where an algorithm is refine by trial and error, get rewards for each action that it takes in a given circumstance.

Whilst Machine Learning, and therefore Reinforcement Learning in this context, is relatively new in the grand scheme of things. Reinforcement Learning has been around for way longer if we think about behavioural modification or in other words...how we might train our pets. Let's take a deeper look...

RL Key Terms

First, I'll cover some of the terminology we need to be familiar with and contextualise them in the pup world..

Agent: The learner or decision-maker that interacts with the environment. This could be the algorithm for refinement OR dog that needs to be trained not to sit on the sofa. The dog in question is called Bonnie.

Environment: The context in which the agent operates. The can be the physical world, a simulated environment, a software application or your home in which Bonnie needs to be trained.

State: A representation of the current situation or configuration of the environment. It's the information the agent uses to make decisions. In the dog context, that could be where the sofa is at a given point, does it have a blanket on it, where am I - the dog's human that might not like them sitting on the sofa? etc.

Action: The set of possible moves or decisions that the agent can choose from at each state. In the dog context, we're looking at a choice of two options: will she decide to sit on the sofa or not?

Reward: A numerical value that the environment provides to the agent as feedback after each action. The reward indicates how good or bad the action was in achieving the agent's goal. A reward is always given for an action, sadly for Bonnie a reward isn't always a good thing! It could be a treat for doing the desired action (not sitting on the sofa) or a scolding for doing the undesired action (sitting on the sofa). In computational RL, there is a scale to indicate how good the behaviour was rather that it always being a binary reward.

Policy: A strategy or set of rules that the agent uses to determine which action to take in each state. The policy is what the agent is trying to learn and optimise.

How to Train Your Dragon*

*dragon..dog..what's the difference really?!

These stages are stepped through and iterated to refine the policy - this is the process of reinforcement learning.

  1. Initialisation: at the start, your agent doesn't have a clear policy. For example, Bonnie doesn't have a clear policy about sitting on the sofa or not. She might explore and occasionally sit on the sofa to see what happens.

  2. State: Bonnie has a look around to see what the situation is (refer to state above for examples).

  3. Action: Bonnie chooses an action based on the current state. If she decided to sit on the sofa or not a reward is coming (she just doesn't know what that might be!)

  4. Reward: Based on the action Bonnie chose, she will either get a treat and praise or she will be scolded and made to move.

Steps 2-4 are iterated over. Each time Bonnie (in theory) learns from the rewards that she receives, starting to understand or at least recognise that "not sitting on the sofa" gets positive rewards and "sitting on the sofa" gets negative ones. Gradually she learns to make better decisions in each state.

This is the process of reinforcement learning, and happens over time.

Exploration vs Exploitation

We've established that the goal of RL is for the agent to find an optimal policy that maximised the expected cumulative reward over time. However, there is a trade off that needs to be considered between the short term and the long term..

Exploration: where the agent tries new actions in a given state to see what the reward might be.

Exploitation: where the agent keeps doing the same action in a given state because they know the reward is good.

With too much exploration, your agent can't really learn what to do, there isn't enough reinforcement of the same action in a given state. However, with too much exploitation you get a very narrow minded agents who only knows what to do in a limited number of states.

There needs to be a balance of the two, the short term gain of exploitation with the long term gain of exploration. Together you will have an agent with an optimised policy or rather, I'll have a well behaved Bonnie!

In this post you've got a high level introduction to how Reinforcement Learning works. No equations and not super technical, but hopefully enough to whet your appetite and get stuck in further if you're interested!

Share Now



More Stories

Cover Image for Tickets, Please?
Ways of WorkingTicketsAgileScrumKanban

Tickets are the building blocks that make up a team’s work, without clearly defined blocks it’s difficult to work efficiently and effectively as a team, catching gaps and avoiding duplication of work.