SCIILE Blue Moon mini-game and Tencent TSTRARBOTS paper ripped [Part I]

gema.parreno.piqueras
4 min readOct 23, 2018

In this article we will dive into a Tencent Paper + University of Rochester + Northwestern University paper and a new mini-game developed by myself ( Blue Moon) that proposes a learning environment for Tech Tree development .

A significant part of TSTARTBOTS paper — Tencent AI Lab + University of Rochester + Northwestern University — comes down to propose a hierarchical model of actions inside the huge action space of Starcraft II Learning environment .

The human thinking model in the game summarizes itself in several levels : macro or global strategy , map control and battle executions or micro. Besides, the macro strategy unrolls micro actions and certain type of macro-game evolution and micro-management .

This unroll that leads to a certain game evolution is known as Tech tree .Blue Moon mini-game , developed a few months before Tencent paper was released , is actually based on explore a scale-up micro and taking into account the tech development of the adversarial race . The objetive of the agent for this learning environment is to maximize the sum of rewards earned through a sequence of adversarial tech development evolution . The good agents will develop tech that responds and is able to beat the opponent. Mini-game thought to develop the “tech-tree” learning algorithm .

Protoss tech tree used to develop Blue Moon mini-game VS Zerg Tech Tree .

Another compelling idea has been the optimization by identifying trivial factors : despite the tremendous decision space of SCII , not all the decisions matter . So there is a differentiation and isolation of questions.
When to build is classified as a Non-trivial question, directly related with sequential problem characterization of Reinforcement Learning . Which probe is going to build and Where is going to build it are classified as trivial actions.

This , however, might not work for micro battle management . As the control groups or types of units that attack ( which unit attack which adversary ) and where is the army positioned in the map might be decisive . This paper , however, proposes a solution to that issue making more than 100 macro combat actions tractable into a matrix in which each part can attack another part . So tracking this problem with this structure justifies the trivial

The Macro-action design space

In this step the research team simplifies the action space by encoding some rules and hiding trivial decisions from the pool of actions , grouping certain kind of macro-building type into one ( Build Roach Warren = Move Camera + Random worker selector +…+ Build Roach Warran )
One of the things that drives the attention is to find all high level strategy actions up to 65 and combat actions of more than 100 .

Learning Algorithms and Neural Network Architectures .

TSTARBOT1 is trained against Dueling Double Deep Q-learning (DDQN) -you can implement your own with a mini game in the pysc2 Codelab — and Proximal Policy Optimization (PPO) with a relatively simple Neural Network architecture and a Distributed Rollout Infrastructure.

TSTARBOT2 adopts a multi-agent style methodology with different controllers for macro -which represents high-level tactics- and micro actions. Each one has a policy based in its parent macro controller and each controller sees only the local action and observation set, which allows to capture the action structure better and to dismiss irrelevant information.
This controllers or modules have been scripted in order to be tested but are not currently created under Reinforcement Learning principles .

The paper seems to insinuate that each controller will be traded as a module, mimicking other structures like UAlbertaBot : the combat strategy and the production strategy are related to macro actions , while they branch in Combat,Scout,Resource and Building) . Each module, wherever is macro or micro are embedded into a DataContext in order to share data structure and communicate and exchange information for taking better decisions .

Each module might have a different machine learning approach in order to be solved

Combat strategy

This module groups the army into squads and chooses different game known strategies for attacking the enemy : those are rush — launch quick attack- , Economy first -launch attack after accumulating a large number of squads- or tech evolved focused, Timing attack — building a specific type of army and then attack , Reform and Harass and chooses from them
Keywords for Combat strategy module : macro, strategy ,

Combat

Focuses on unit-level manipulation to effectively let each unit fight against the enemy . It implements some basic human-like micro-management tactics, such as hit-and-run, cover-attack, etc. This micro-action can definitely be learnt in the near future by RL agents with specific built mini-games .

Production Strategy

This module manages the tech upgrading , resource harvesting and building/unit production .Different choices regarding the Tech Tree might lead to different unit productions and its correlated to the resource harvesting command .
In this part of the paper , the Tencent team talks about an opening order and a goal planning function .

From this module it should derive the Building and the Resource modules, which follow the
Mineral first or gas first approach , a high level command called “resource type priority” that might tell which resource might be collected first. Besides, it has a module that stands for expansion and going for more resources at the nearest bases .

The Tencent innovation comes more from an hierarchy management rather than a new Reinforcement Learning perspective.

Stay tunned for Part II , in which we will dig into the experiments and results of the research .

--

--