StarCraft II Learning environment full overview (III)
Theory : Deep Reinforcement Learning Overview
“Prismatic core online” Stalker Protoss repeatedly selected order unit quote
The Reinforcement Learning Problem
The paradigm of learning by trial-and-error, exclusively from rewards is known as Reinforcement Learning . The essence of RL is learning through interaction , mimicking the human way of learning[1] with an interaction with environment and has its roots in behaviourist psychology .
For a definition of the Reinforcement Learning problem we need to define an environment in which a perception-action-learning loop[fig1] takes place . In this environment, the agent observes a given state t The agent, leaning in the policy, interacts with the environment by taking an action in a given state that may have long term consequences . It goes into a next state with a given timestep t+1 and updates the policy. At the end, the agent receives observations/states from the environment and a reward as a sign of feedback, and interacts with the environment through the actions.
[fig1] Perception-action-learning loop
In RL , we trait both prediction and control problems, drained by state-value or action-value function. When you go down algorithmically in this challenge, it comes in how you treat the value function and how the policy strategy affects to the agent .
The Reinforcement Learning problem can be described formally into a Markov decision process or MDP: it describes an environment for reinforcement learning, the surroundings or conditions in which the agent learns or operates. The Markov process is a sequence of states with the Markov property , which claims that the future is independent of the past given the present . That makes that we only shall need the last state to evaluate the agent.
RL Agents
Reinforcement learning agents may include one or more of these components [2].
- Value function V(s). Acts as an evaluator and process how much reward we expects to get , or, in other words, measures how well we are doing . In RL taxonomy, you can find a classification that includes value based or policy based agents.
- Policy: Π . Understood as the Agent´s behaviour function. The policy maps from state to action , and can be deterministic or stochastic . RL taxonomy includes ON policy algorithms -learning on the job or OFF policy algorithms — learning from other behaviour- Policy ON and OFF algorithms differentiate about how it calculates the value function . On-policy algos have no memory and v(s) comes from the same policy. In Policy-off algos , V(s) comes from another policy . The goal of RL is to find an optimal policy which achieves the maximum expected return from all states.
- Model. Agent´s representation of the environment . RL taxonomy establish model free and model based agents. Model-based have the knowledge about the environment and can be partial or totally observable .
There are two main approaches to solving RL problems : Value-function and Policy Search[fig2] . There is an hybrid explored recently also known as actor-critic approach
[fig2] Policy learning explanation graphic approach
How Policy learning works : In a state St, we compute Q(s,a) and it takes one of the actions that helps to achieve the goal . The agents acts with the environment by executing actions, and the environment acts with the agent by giving it the observations inherited from the actions the agent took.
The value function
The Value function is a mathematical abstraction that acts as an evaluator of the agent. The reinforcement learning taxonomy and creation approach might define itself on how we treat the value function. It is important to underline that there are two kinds of value function :
- State-value function. V(s) : measures how good is to be in a particular state s.
- Action-value function. Q (s, a) : measures How good is to take a particular action . Bellman Optimality equation, which is basic for defining the RL solution approach , is based in the Action-value function.
Into Dynamic Programming context, policy search do not need to maintain a value function model, but search for an optimal policy .The process of producing an agent comes in two phases : policy evaluation and policy iteration ( improvement ) . They treat prediction and control problem respectively.[3] In some cases innovations come in the way we think and threat the value function.
What is Deep Reinforcement Learning?
Reinforcement Learning is an area of Machine Learning based on mimicking the human way of learning[1]. We will see later on that agents from pySC2 construct and learn their own knowledge directly from raw inputs , such as vision, using Deep Learning. The combination of these approaches, with the goal of achieving human-level performance across many challenging domains receive the name of Deep Reinforcement Learning[3]. DRL can deal with some of the problems that RL has like the curse of dimensionality [6] .In Deep Reinforcement learning, we use Neural Networks act as function approximator for solving the problem of finding optimal action-value functions ( Q -table ).
Note DQN = Deep Neural Network for Q Learning approximation
[6]In general, DRL is based on training deep neural networks to approximate the optimal policy Π, and/or the optimal value functions V(s), Q and A . However, deep Q-networks are one way to solve deep RL problem . Another state of the art algorithm , A3C, introducing methods based on asynchronous RL. Video games may be an interesting challenge, but learning how to play them is not the end goal of DRL [4] , as the main goal is the vision of creating systems that are capable of learning how to adapt in the real world . In this step, we dig into a DRL agent for getting started with the machine learning challenge. Having said that, the environment present itself suitable for Reinforcement Learning in general and not only DRL .
DQN Algorithm Overview[fig3]
In a high level, the algo takes the input as the RGB pixels and gives the output of the value of the suitable actions that the agent might take with a given policy . It initializes random action-value functions Q (s, a) and in a loop of all episodes and all timesteps, follow a greedy policy in which selects a random action or the one in which the action-value function is higher with probability ε . Execute action and store transition tuple in memory .
[fig3] Deep Q-Network(DQN), adapted from Mnih et al(2015)
DQN Algorithm optimization : Target Network , Experience Replay and Hubber Loss
Experience Replay allows training using stored memories from its experience, and avoids to forget states that the Neural model hasn´t seen in a while. The idea behind experience replay consists in : at each Q-learning iteration, you play one step in the game, but instead of updating the model based on that step, you add the information from the step you took to a memory and then make a batch sample of that memory for training . [3] The Target Network brings stability to learning [7] . Every a number of iterations, we will make a copy of the Q-Network and use it to compute the target instead of the current Q-network. Hubber loss comes to estabilices learning in short and long term, as this function os quadratic for small values and linear for large values. For those familiarized with the loss function coming from Neural Networks , it comes to say that it diverges into two different implementations.
DQN variations: Double Deep Q-learning with dueling Network Architecture
Double Q-learning[fig4] is a model-free policy off algorithm can be used at scale to successfully reduce the overoptimism inherited from DQN, resulting in more stable and reliable learning. In Double Deep Q learning, two value functions V(s) are calculated : one set of weights is used to determine the greedy policy — improvement- and the other is used to determine its value -evaluation. In this case, the evaluation network trains and from time to time it passes the weights to the improvement network. Then the improvement network learns and passes it again to the evaluation network . [10]
Dueling DQN helps to generalize learning and the key insight behind the architecture is that for many states, it is unnecessary to estimate the value of each action choice.
Besides, it decouples the idea of action-value Q(s,a) function into state-value function V(s) and advantage function A(s,a), which leads to better policy evaluation, and has shown good results [11] , shining in attention and focus.
The dueling network has two streams to separately estimate scalar state-value and the advantages for each action: a module implements an equation — non trivial- to combine them.
Note The key ideas for this architecture are : reduce overoptimism and help with generalization.
Congratulations!
Now you have passed though the agent and the main concepts of Machine Learning needed .
Thanks for being around!
Let´s go deeper with the overview of the learning environment!