StarCraft II Unplugged : Offline Reinforcement Learning

  • Data Source. Ensure that the dataset is not biassed from the datasets generated by RL agents. One of the challenges coming from Offline Reinforcement learning comes from the insurance of the diversity and richness of the set of gameplay strategies coming from the dataset [3] , referred to in the paper as coverage. Reference authors[3] also refer to this challenge as ensuring the presence of high-reward regions of the state space
  • Large, hierarchical and structured Action Space . In order to perform the gameplay correctly , the agent must select actions, and units to apply that action too, as well as control the map for performing the action. Besides, there are 10e26 possible actions per step. From the coverage perspective, this represents the challenge of selecting adequate hierarchical actions in the form of a sequence that ensure diversity of game strategies.
  • Stochastic environment . In stochastic environments, the next´s agent state cannot always be determined based on its current state and action, meaning that taking an action does not guarantee that the agent will end up in the state it intends to. These environments may need more trajectories to obtain high state-action coverage.
  • Partial observability. The agent doesn’t know what the opponent is doing unless it explores the environment, thanks to the Fog of war. This exploration might entail the use of this information later in the game, which might imply the use of memory to ensure coverage.

Offline Reinforcement learning fundamentals

Fig 1 : The main idea in offline RL is to use a dataset D that is not altered during training to train a policy . In the training phase, the dataset is used to train a policy without any interaction with the environment. During the deployment phase, the learned policy is tested with the environment. This cycle can be actually used to train new policies.. This image is an adaptation from ref [2] to StarCraft II environment

The Dataset

  • 1.4 million human plays ranging from 3500 MMR coming from top 22 % of players with 2.8 million episodes in the dataset, representing a total of more than 30 years of game played.
  • Take into account the average 2-player game time ( 11 minutes )
  • Filtering the frames that only include actions cutting the length of episode by 12 times.
  • Fine Tuning of some algorithms presented in the benchmark are done with another dataset of high quality data of MMR > 6200 and only winning games.
Fig 2. The dataset is a set of state-action tuples called rollouts , explained as sequences of K consecutive timesteps assembled in a minibatch of M independent rollouts. This follows the structure given by other offline initiatives such as DL4R. So, if we choose a rollout of K = 1, we will be selecting a trajectory . If we select a more extended K, we will be having into account longer trajectories. A minibatch M takes into account independent rollouts . The minibatch size M and rollout K size influence the final behaviour cloning performance.

Observation and action Space


Fig4. The aim of this figure is to show how the agent sees or interpret Starcarft . Example of observations at timestep t based in how the API processes the game interface . The figure shows three hierarchical levels : a overall Global world tensor of 128 x 128 pixels, that include minimap information, called in the input feature plans . Information about units in the form of a list nesting a list that can expand from 43 to 512 . Scalar inputs, Unit features and unit arguments inputs are global information in the game interface . Feature maps figure has been simplified with Simple64 capture map, but offers an overview of how the agent would see the Nexus and probes in the ScreenPlay.


Fig 5. Each action is encoded in the StarCraft II AP (pysc2)I through an action represented by a function that has 7 arguments . Each argument solves questions about the specific action, as in when or where exactly should that action be performed. These arguments are calculated sequentially after the function .
Fig6..Figure that resume an action, move the Zealot to a given position in the map.The action space is divided general outputs: SCALARS , that aims to solve the question of WHAT shall be done. The Units tell us WHO should do it , and the world WHERE shall it be do it.

Neural Network Architecture

Fig 6 . Overall architecture used for all the reference agents. Reading it from bottom to top : observations in a 3 hierarchical levels — feature planes, units and vectors — are processed into trainable modules — and perform different fixed operations , to produce actions. The actions are executed sequentially , from left to right. For performing the actions, a logit module is introduced to give a response to the categorical distribution of the action selected.

Agents & Evaluation Metrics

Figure coming from NeuRIPs 2021 DRL Workshop

Sampled Muzero & MuZero Unplugged : Offline RL improved in the Online setting

  • Sampled MuZero , an extension of MuZero algorithm that is able to learn in domains with complex action spaces by planning with action sampling [5] , making policy iterations over a subset of sampled actions.
  • MuZero Unplugged , which trackles offline RL optimization through both learning from data (offline) and using value improvement operators ( such as MCTS ) when interacting with the environment (online). In one of its ratio settings, it is fully conceived for the Offline setting.

Evaluation Metrics



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Artificial Intelligence. Data visualization