All these examples can be unified under a general formulation: performing an action in a scenario can yield a reward. A more technical term for scenario is a state. And we call the collection of all possible states a state-space. Performing of an action causes the state to change. But the question is, what series of actions yields the highest cumulative rewards?
Here are some examples to open your eyes to some and successful uses of RL by Google:
- Game playing
- More game playing
- Robotics and control:
It's not supervised learning, because the training data comes from the algorithm deciding between exploration and exploitation. And it's not unsupervised because the algorithm receives feedback from the environment. As long as you're in a situation where performing an action in a state produces a reward, you can use reinforcement learning to discover the best sequence of actions to take.
You may notice that reinforcement learning lingo involves anthropomorphizing the algorithm into taking "actions" in "situations" to "receive rewards." In fact, the algorithm is often referred to as an "agent" that "acts with" the environment. It should't be a surprise that much of reinforcement learning theory is applied in robotics.
A robot performs actions to change between different states. But how does it decide which action to take? The next section introduces a new concept, called the policy, the answer this question.