StarCraft II Unplugged : Offline Reinforcement Learning

12 min readFeb 21, 2022


StarCraft II is a Real Strategy Game developed by Blizzard and it is a challenge as it shows some properties interesting from the machine learning perspective : real time, partial observability and vast action and observation space . Mastering the game entails strategic planning over time , and control in a macro and micro-level in real time with the characteristic of counter-react to an opponent in real time.

In this article we will cover StarCarft II Unplugged paper[1] , a benchmark that covers algorithms and agents with respect to the reference AlphaStar paper[2].The main novelties presented cover a Dataset for training , evaluation metric and baseline agents benchmarked against the reference agents coming from the AlphaStar. In a very broad basis, this work is based on learning from a dataset of human replays , and proposes off-line Reinforcement Learning policy evaluation methods with some online policy improvements.

From the off-line RL perspective , the paper highlights properties of StarCraft that might be interesting for this challenge.

  • Data Source. Ensure that the dataset is not biassed from the datasets generated by RL agents. One of the challenges coming from Offline Reinforcement learning comes from the insurance of the diversity and richness of the set of gameplay strategies coming from the dataset [3] , referred to in the paper as coverage. Reference authors[3] also refer to this challenge as ensuring the presence of high-reward regions of the state space
  • Large, hierarchical and structured Action Space . In order to perform the gameplay correctly , the agent must select actions, and units to apply that action too, as well as control the map for performing the action. Besides, there are 10e26 possible actions per step. From the coverage perspective, this represents the challenge of selecting adequate hierarchical actions in the form of a sequence that ensure diversity of game strategies.
  • Stochastic environment . In stochastic environments, the next´s agent state cannot always be determined based on its current state and action, meaning that taking an action does not guarantee that the agent will end up in the state it intends to. These environments may need more trajectories to obtain high state-action coverage.
  • Partial observability. The agent doesn’t know what the opponent is doing unless it explores the environment, thanks to the Fog of war. This exploration might entail the use of this information later in the game, which might imply the use of memory to ensure coverage.

Offline Reinforcement learning fundamentals

Reinforcement Learning in general can be defined as a subfield of machine learning that is based on learning through an interaction with an environment with a feedback reward sign, with the general goal of learning a policy π that maximises that reward. Under this framework, there exist different types of taxonomy or classification with respect to the learning methodology : in online RL , the agent interacts with the environment at each time step . In off-policy RL, the agent`s experience is stored into a replay buffer D that updates the policy .

Another key concept about research processes that we might want to have into account from Reinforcement learning is that the process of learning this policy π through a value function that estimates the expected return from a state s or a state action pair is called policy evaluation . The value function can then be used in the process of policy improvement to increase the probabilities of actions with higher values. Last, but not least , the process of repeatedly doing policy evaluation plus policy improvement is the core of Reinforcement learning algorithmic approaches and received the name of policy iteration.

Fig 1 : The main idea in offline RL is to use a dataset D that is not altered during training to train a policy . In the training phase, the dataset is used to train a policy without any interaction with the environment. During the deployment phase, the learned policy is tested with the environment. This cycle can be actually used to train new policies.. This image is an adaptation from ref [2] to StarCraft II environment

In offline RL formulation, the principles coming from supervised learning of data-driven methods ,uses of a dataset and uses a collected dataset that follows MDP structure to train a policy. However, the agent no longer has the ability to interact with the environment and collect additional transitions using the learned policy. Instead, the algorithm is provided with a static dataset of transitions that can be known as a training dataset .

Removing the reinforcement learning training loop from online reinforcement learning lowers the compute demanding needs for experimenting with StarCraft, making it more accessible to the research community[1]. However and therefore, one of the main challenges of Offline RL is that we want the learned policy to perform better than the behaviour seen in the Dataset D, that is, in practice, execute a sequence of actions that is in some way different , desirably better, from the pattern of behaviour observed in the dataset D. In the case of unplugged StarCraft , even though the algorithms do not collect more data by interacting with the environment, learned policies can be run in the environment to measure how well they perform, and this evaluation might be useful for hyperparameter tunning. As a general learned lesson coming from the paper regarding creating successful agents inside StarCraft challenge, comes first to train a policy .

Offline Reinforcement learning field can use different approaches from online and offline Reinforcement learning to learn a policy. In this particular case, experiments have shown that algorithms with behaviour value estimation — heavily based on calculating the value function-

The Dataset

The StarCraft Dataset has been constructed thanks to a filtering of the publicly available 20 million StarCraft II games . The data distribution is positively skewed , and takes into account considerations of the game

  • 1.4 million human plays ranging from 3500 MMR coming from top 22 % of players with 2.8 million episodes in the dataset, representing a total of more than 30 years of game played.
  • Take into account the average 2-player game time ( 11 minutes )
  • Filtering the frames that only include actions cutting the length of episode by 12 times.
  • Fine Tuning of some algorithms presented in the benchmark are done with another dataset of high quality data of MMR > 6200 and only winning games.

Note : Not all the time-length replay forms part of the dataset, as they shorten the trajectories with steps in which the player took an action. This entails , from the engineering perspective and the API, that all trajectories with no actions ( NO_OP in the pysc2 API ) have been removed .

The Replay package released by Blizzard includes 5M replays in a video format. Under the umbrella of the concept of coverage , this work filters a set of Replays and transforms the video files into a set of rollouts that contains sequences of K consecutive timesteps ( trajectories ) , assembled in a mini batch M of independent rollouts. The rollout can be understood then as the length of consecutive trajectories

Fig 2. The dataset is a set of state-action tuples called rollouts , explained as sequences of K consecutive timesteps assembled in a minibatch of M independent rollouts. This follows the structure given by other offline initiatives such as DL4R. So, if we choose a rollout of K = 1, we will be selecting a trajectory . If we select a more extended K, we will be having into account longer trajectories. A minibatch M takes into account independent rollouts . The minibatch size M and rollout K size influence the final behaviour cloning performance.

Observation and action Space

The complexity of StarCraft GamePlay leads to an hierarchical classification of the observation/action space that include information about the World, Units and Scalar inputs and abilities that StrarCraft II units can execute, encoded through the pysc2 API. Find below a summary of both the observation and action space including GameHuman Screenshot and agent interpretation of those observations and actions , aiming to solve the high level questions of

How does the agent see the world ? and how does the agent execute its decisions ?


The observation space can be divided into three main components and overall composes the state of the trajectories in the dataset D . These components are described above and shown in Fig4.

World : 128x128 Tensor Data Structure with map and minimap analysis with the game fundamental control issues, that include map information and player control with respect to that map. This will act as the feature planes in the architecture output

Units : List of units observed by the agent. It contains all of the agent´s units as well as the opponent´s units when within the agent´s vision and the last known state of the opponent´s buildings. For each unit, the observation is a vector of size 43 containing all the information that would be available in the game interface. This might be interpreted as the micro-management observation of the game.

Scalars:: Global inputs in the form of a 1-d vector that It includes the resources ( minerals and vespene gas ) , information of the opponent, worker information and the unit costs. This might be interpreted as the macro-management observation of the game.

Fig4. The aim of this figure is to show how the agent sees or interpret Starcarft . Example of observations at timestep t based in how the API processes the game interface . The figure shows three hierarchical levels : a overall Global world tensor of 128 x 128 pixels, that include minimap information, called in the input feature plans . Information about units in the form of a list nesting a list that can expand from 43 to 512 . Scalar inputs, Unit features and unit arguments inputs are global information in the game interface . Feature maps figure has been simplified with Simple64 capture map, but offers an overview of how the agent would see the Nexus and probes in the ScreenPlay.


Actions combine up 3 standard actions that form part of the gameplay : unit selection ( WHO should do something ) , ability selection ( WHAT would that unit do ) and target selection ( TO WHOM should this action have impact in ). Each raw action is subdivided into up to 7 arguments, which are taken successively. These arguments are not independent of each other. Once the function action is selected and the agent selects its ability, the rest of the arguments are calculated successively .

Fig 5. Each action is encoded in the StarCraft II AP (pysc2)I through an action represented by a function that has 7 arguments . Each argument solves questions about the specific action, as in when or where exactly should that action be performed. These arguments are calculated sequentially after the function .

Find below an example of the action space in which a zealot selected moves to a given point and more zealots are trained in the Gateway .

Fig6..Figure that resume an action, move the Zealot to a given position in the map.The action space is divided general outputs: SCALARS , that aims to solve the question of WHAT shall be done. The Units tell us WHO should do it , and the world WHERE shall it be do it.

Neural Network Architecture

The Neural Net architecture coming from this paper is different from Nature AlphaStar : As a general approach with respect to the AlphaStar Architecture, the AlphaStar agent uses an LSTM module. RL Unplugged paper shows that removing this memory module results in better performance. The win rate of the LSTM-based architecture reached a 70 % win-rate vs a 84 % for the memory-less agent against the very-hard bot.

As we have been seeing before in the Observations and Actions section , Pysc2[4] API provides different kind of observations that are encoded into three parts . If previous part described the overall observation and how is that associated with the Starcraft gameplay , now we will focus on how these inputs are processed through the Neural Network Architecture :

Inputs are divided into 3 levels that are processed independently and interact in between them at a certain point : Vectors are 1-dimensional vectors that encode global scalar values about the game . Units contain a list of units observed by the agent, both proprietary and from the opponent : each unit contains a vector of size 43 . and feature planes , 128x128 tensor that encodes world information . All this information is provided by the API.

Modules that the model will learn. It involves different Neural Net architectures such as MLP (Multilayer perceptron) , ConvNets or Transformers.

Operations that aims to process information in am meaningful way in order to help to other trainable module or process information to another part to the Neural Net Architecture

In this case , the logit operation calculates the probability distribution of the output to match a function argument.

Hierarchical set of actions with 7 arguments that responds to an overall function

Fig 6 . Overall architecture used for all the reference agents. Reading it from bottom to top : observations in a 3 hierarchical levels — feature planes, units and vectors — are processed into trainable modules — and perform different fixed operations , to produce actions. The actions are executed sequentially , from left to right. For performing the actions, a logit module is introduced to give a response to the categorical distribution of the action selected.

Ultimately, the Neural Network weights Ө are calculated in through the different set of agent and algorithmic approaches.

Agents & Evaluation Metrics

The work presents 6 new agents with 3 different general categories and its comparison with respect to reference AlphaStar agents coming from the original paper with some empirical insights in the policy improvement phase.The main difference in between original AlphaStar paper and the reference agents from this work is that these are trained to play all three races of StarCraft II and replays coming from an MMR higher than 3500 .

As some of the highlights about the agents , we could say that the use of traditional offline RL methods have been exploring rewards to execute policy improvement, and here each strategy for policy improvement is conditioned to their algorithmic approaches.

has come to fine-tuning with a better set of replays ,plus the intro of MuZero Supervised that ultimately outperforms all agents with MCTS. Find above the table that describes the performance of the agents with the evaluation metrics.

Figure coming from NeuRIPs 2021 DRL Workshop

Fig 7. Table presented by paper authors in DRL Workshop at Neurips2021. Agent performance with 3 different categories. Yellow : showing three agents with Behaviour cloning approach, coming from AlphaStar and imitation (BC) and value function approach (FT-BC) . Green shows offline actor critic approach in their Offline (OAC) and empathic form (E-OAC) . Orange shows Mu-zero agents.

What is the difference between all these methods ?

On a very broad level , The conceptual difference between the agents is that BC are heavily based on learning from human replays, in which the last iteration FT-BC is supported by the value function calculation Vπ and fine-tuning with a set of replays of MMR > 6200.Besides, it can be said that these two first algorithms are used as a baseline for the rest, as the Neural Network weights θ are used for the actor critic and MuZero methods. Therefore, BC policy calculation from data μ that will act as an initialization for the rest of the agents.

Sampled Muzero & MuZero Unplugged : Offline RL improved in the Online setting

Having these broad fundamentals into account, two of the most innovative approaches presented in this work comes down to a twist coming from two different works, with some empirical discoveries and lessons learnt :

  • Sampled MuZero , an extension of MuZero algorithm that is able to learn in domains with complex action spaces by planning with action sampling [5] , making policy iterations over a subset of sampled actions.
  • MuZero Unplugged , which trackles offline RL optimization through both learning from data (offline) and using value improvement operators ( such as MCTS ) when interacting with the environment (online). In one of its ratio settings, it is fully conceived for the Offline setting.

Evaluation Metrics

For agent benchmarking there are two main metrics that are used to evaluate how good agents are with respect to the rest of them : robustness and Elo computation. First, the outcome of a game is defined by the probability of winning of the agent p ( player ) with respect to the other reference agents across all races estimated by playing matches

For evaluation metrics, there is a first calculation of the probability of the agent p of winning against all reference agents by playing matches over uniformly sampled maps and starting locations. Then the agent p plays against all reference agents q to calculate irs robustness. An ELO computation system for calculating the relative skill of players . Robustness metrics aims to benchmark against the best opponent , following the vector thought of the agent p being more robust as it has the higher probability of winning against the strongest opponent.

— —

StarCraft RL Unplugged paper covers a interesting approach to this challenge from the Offline RL perspective. Looking forward for digging more into it and enjoy the Open Source release!

Art is based in StarCraft Scale

[1] StarCraft II Unplugged paper





Artificial Intelligence. Data visualization

Recommended from Medium


See more recommendations