QMIX paper ripped: Monotonic Value Function Factorization for Deep Multi-agent Reinforcement Learning in StarCraft II

gema.parreno.piqueras
5 min readMay 11, 2019
‘I knew you would find your way here…eventually ‘— Queen of Blades to Zeratul —

StarCraft II has been present as a machine learning environment for research since BloodWar. A couple of years ago, DeepMind released pysc2, a research environment for StarCraft II and later in 2019, Whiteson Oxford Research Lab open-sourced SMAC, a cooperative multiagent environment based on pysc2 with a cooperative setup, meaning that in this environment multiple agents cooperate towards a common goal. This environment is different from the one built in pysc2 : not conceived to address the challenge of the full ladder game but to build a new bounded set of multiagent problems in a cooperative setup, focusing on the micro-challenge.

Today we will dive into the QMIX overall algo and architecture

Multi-agent Reinforcement Learning (MARL) deals with challenges such as the curse of dimensionality in actions, the multi-agent credit assignment problem and modeling other agents information state. The innovations in MARL is also to calculate, represent and use the action-value function that most RL methods learn.

There is a change of paradigm with this Multiagent setup regarding MDPs, that naturally evolve to MAMDPs: In multi-agent MDPs, there are a bunch of other agents interacting with the environment, so it makes the environment non-stationary: meaning that the way the environment evolves is based also in the behavior of other agents. Besides, being cooperative means that it would be a shared team reward and ultimately it will focus on a coordination problem.

The paper about Q_MIX explores a hybrid value-based multi-agent reinforcement learning method, adding a constraint and a mixing Network structure in order to make the learning stable, faster and ultimately better in a controlled setup.

The concept of Decentralised policy comes to say that each agent makes individual decisions based on both information observed locally and the messages received from the neighbors. Centralized learning comes to talk about the learning method. So we define an Action-Value function Qtot (centralised learning ) and an Action-Value function Qa that correspond to each one of the agents.

Q-MIX : in between COMA and VDN

QMIX is a hybrid approach that can represent a richer class of action-value functions.QMIX takes ideas from COMA in order to address the multi-agent credit assignment problem and proposes constraints that overcome VDN in order to deal with the course of dimensionality in actions.

One of the main first ideas is to verify a constraint that enforces the monotonicity ( non-increasing with the first derivative being positive in this case) of the relationship in between the Global Action-Value function and the Action-Value function of each one of the agents, in every action, avoiding factorization coming from VDN

This function allows each agent to participate in a decentralised execution by choosing greedy actions with respect to its value function.

The overall QMIX architecture.

The overall architecture of Q mix shows 3 differentiated parts :

Qa represent the agent network, calculated for each one of the agents that act cooperatively
  • Agent Networks : For each agent a, there is an agent Network that represents its value function. It receives the current observation and the last action as input at each time step and returns a Q action-value function. The NN topology is inside the DRQN family that make use of GRU, as it facilitates the learning over longer timescales and probably converges faster.

Meaning that if we are dealing with an environment with, for example, four agents, we might have 4 RNN Agents. The implemented code is defined in PyMARL in pytorch

import torch.nn as nn
import torch.nn.functional as F
class RNNAgent(nn.Module):
def __init__(self, input_shape, args):
super(RNNAgent, self).__init__()
self.args = args
self.fc1 = nn.Linear(input_shape, args.rnn_hidden_dim)
self.rnn = nn.GRUCell(args.rnn_hidden_dim, args.rnn_hidden_dim)
self.fc2 = nn.Linear(args.rnn_hidden_dim, args.n_actions)
def init_hidden(self):
# make hidden states on same device as model
return self.fc1.weight.new(1, self.args.rnn_hidden_dim).zero_()

The learning phase is coded in the PyMARL framework, here

  • Mixing Network: A feedforward Network that takes the agents outputs ( Qa Action-Value functions of each one of the agents ) and outputs the total Action-value function (Qtot).

Here comes the interesting part: the weights of the neural network are produced by a separate hypernetwork, meaning that there is a Neural Network that generates the weights for another network. The output of the hypernetwork is then a vector. The use of hypernetworks ( forced to be positive ) makes it possible to condition the weights of the monotonic network

Hypernetwork that calculates the forced positive weights of the Mixing Network

In order to force the constraint described above, the weights are restricted to be non-negative

In red are the hypernetworks that help to calculate the weights (forced to be positive) and biases ( not forced to be positive) for the Mixing Network. This idea of hypernetwork with the constraint defined might be the key to the overall better performance in a controlled setup.

The implementation of the Mixing Network QMixer comes bellow

class QMixer(nn.Module):
def __init__(self, args):
super(QMixer, self).__init__()
self.args = args
self.n_agents = args.n_agents
self.state_dim = int(np.prod(args.state_shape)) self.embed_dim = args.mixing_embed_dim
self.hyper_w_1 = nn.Linear(self.state_dim, self.embed_dim * self.n_agents)
self.hyper_w_final = nn.Linear(self.state_dim, self.embed_dim)
# State dependent bias for hidden layer self.hyper_b_1 = nn.Linear(self.state_dim, self.embed_dim)# V(s) instead of a bias for the last layers
self.V = nn.Sequential(nn.Linear(self.state_dim, self.embed_dim),
nn.ReLU(), nn.Linear(self.embed_dim, 1))

The Overall architecture can be seen below, in which each Action-value function of the agent Network as an input of the Mixing Network that produces that Monotonic Value function ( Qtot), the main idea behind this paper

Overall architecture of QMIX, in which each DRQN agent is feed into a Mixing Network that uses a constrained hypernetwork to calculate the Global Action-Value function
‘We adapt’— Queen of Blades

Thanks for reaching this point and for learning more about multiagent cooperative setup and QMIX. Don´t hesitate to read the paper for knowing more about it.

Would like to aim you to visit more about the SMAC, and the implementations with the framework PyMARL !

gl&hf!

Want to get started with StarCraft II learning environment? Visit this codelab Here you can know more about what and where I work

--

--