A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning

gema.parreno.piqueras

4 min readFeb 13, 2019

Today we will dig into a paper ripped of A Unified Game-Theoretic Approach to Multi-agent Reinforcement Learning , one of the core ideas that has been used for the development of #AlphaStar . There are several concepts in AlphaStar that won´t be treated here . The aim is to dig in the concepts that what has been as the “Nash League” conceptual functioning and how game theory came to mix with reinforcement learning .

At the end of this article you should have a notion of Double Oracle algorithm, Deep Cognitive Hierarchies and Policy-Space Response Oracles .

For this post you should be familiarized with some concepts about game theory , like the setup of the strategic game in form of the payoff matrix, the understanding of Nash Equilibria and best response. You can visit some conceptual implementations and a numpy python implementation here

Why a Multi-agent context ?

In #AlphaStar, a multi-agent setup has been developed in order to improve the performance in terms of strategy of the Agent through self-play, meaning that the strategy has followed a population based multi-agent to improve itself with a Reinforcement learning approach . MARL -multi-agent reinforcement learning -bases its iterative improvements in approximate best responses to mixtures of policies generated using deep reinforcement learning .

For digging into multi-agent RL we also shall note the notation of a normal-form game in terms of game theory : a tuple ( π, U, n) defined for n as the number of players, U as pay-off utility functions and π policies.

Double Oracle Algorithm

DO Algo and its generalization presents itself as one of the core idea for understanding the reinforcement learning process of multi-agent systems .

Double Oracle algorithm calculates Pay-off matrix of the sub-game iteratively , using Deep Neural Networks for function approximation . This algorithm is an iterative form of the the subgame Gt presented as a payoff matrix. At each timestep t , an equilibrium response ( σ ) is calculated and a best response in form of the policy (π) , that will upgrade into Gt+1 .As function approximators for the payoff-matrix we will use Deep Neural Networks .

Visual Interpretation of DO Algorithm . Authors of DO algo : Bosansky, Lisky, Cermak,Vitek,Pechoucek

As a significant detail, It shall be noticed that here approximate best response is calculated instead of best response, because is computationally factible and gives well enough results.

For GENERALIZATION and SCALE we use two improvements : Policy-Space Response Oracles and Deep Cognitive Hierarchies (PSRO) and (DCH)

Policy-Space Response Oracles (PSRO)

The algorithm is a Generalization of Double Oracle where the meta-game´s choices are policies rather than actions. In this case we sample a initial policy and compute the utility function , and initialize the meta-strategy , a function of distribution of policies.

The meta-game starts with a single policy and growing at each epoch, by adding policies (“oracles”) that approximate responses to the meta-strategy of the other players.

Note that this algorithm suffers from scalability problems, as the while for each epoch , and for each player and episode, there is a calculation of the new policy pi and meta-strategy based on each player . Instead of computing the exact best response , an approximate best response using RL is calculated .

Normal formal game explained by a tuple of ( Policy, Utility function, number of players) . INPUT → Policies from players. OUTPUT → Solution strategy for each player

In a high level , the algorithm samples multiple policies from multiple players and apply the Double Oracle algorithm in a reinforcement learning process. The Output is a function of distribution of the different policies.

Deep Cognitive Hierarchies (DCH)

DCH approximates PSRO. It trades away accuracy for scalability . The reinforcement learning step can take a long time to converge to a good result, so for generalization shall a parallel form of PSRO been proposed : instead of epochs k levels are chosen in advance : each one trains a single oracle policy and runs processes in parallel, saved to a disk periodically. Besides, it also adds the reinforcement learning process explained above .

For avoiding overfitting in the learning process , some joint policy correlation matrices are calculated having different seeds used to initialize the random number generators.

Conclusions and further readings :

Thanks for reaching this point ! By now, you should be familiarized with the Double Oracle Algorithm , the policy-Space Response Oracles and Deep Cognitive Hierarchies. Some concepts like meta-solvers keep out of the article without meaning that they deserved to be studied , as they reduce algorithm big O .

Another readings from a Generalized Method for Empirical Game Theoretic Analysis and Asymmetric multi-agent reinforcement learning might help to bring more ideas into asymmetric solution in pure strategies .