About communication in Multi-Agent Reinforcement Learning

gema.parreno.piqueras
8 min readJun 4, 2019

Communication is one of the components of MARL and an active area of research itself, as it might influence the final performance of agents, and it affects coordination or negotiation directly. Effective communication is essential in order to interact successfully, solving the challenges of cooperation, coordination, and negotiation between several agents.

Most research in multiagent systems has tried to address the communication needs of an agent: what information to send, when, and to whom and result in strategies that are optimized for the specific application for which they are adopted. Known communication protocols such as Cheap talk can be seen as “doing by talking” in which talk precedes action. Others include “talking by doing” in which one of the agents have incomplete information and stands for actions speaking louder than words.

We will go through how different approaches for learning communication protocols with Deep Neural Networks can help and some novel ideas in the ripped of three different papers, one as a baseline and the other two taking part at ICML 2019:

1 ) Learning to communicate with Deep Multi-Agent Reinforcement Learning paper: Identification of messages as communication protocols and have them in a Q-Learning scenario in which are trained and influence the action selection. Born of a unit DRU that is able to enrich the training signal and make communication shareable and trainable among agents.

2 ) Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning paper: Introduces the innovation of decentralized learning, as previously seen only decentralized execution and new reward functions, metrics, and topologies to approach the communication challenge

3 ) TARMAC: Targeted communication for MARL paper, which explores the benefits of targeted as well as multi-stage communication. The goal is to offer a possible solution to complex collaboration strategies with a custom attention mechanism.

One of the key concepts here is the centralized/decentralised tradeoff in between learning and execution that is exposed in the different papers. The baseline is to find centralized learning -parameter sharing- with decentralised execution -each agent executes their outputs independently- but it changes depending on the paper and experiment.

Let´s get started with Learning to Communicate with Deep Multi-Agent RL paper from 2016 that offers a first baseline for understanding the use of Neural Networks in MARL.

1 ) Learning to Communicate with Deep Multi-Agent Reinforcement Learning

Learning to Communicate with Deep Multi-agent Reinforcement Learning takes a step towards how agents can use machine learning to automatically discover the communication protocols in a cooperative setup and what Deep Learning can offer to this, as deep neural networks are used to learn communication protocols in multiagent systems with partial observability using two different approaches : Reinforced Inter-Agent Learning ( RIAL ) and Differentiable Inter-Agent Learning (DIAL) , having as main difference how gradients flow in the learning loop making a step towards differentiable communication: while RIAL is end-to-end trainable within each agent, DIAL is end-to-end trainable across agents.

RIAL vs DIAL: Trainable each agent approach vs across all agents with the DRU discretize/regularized unit in a multiagent setup. RIAL : In a timestep t, the environment gives an observation ot to Agent 1, together with another input mt-1 -the communication message from the previous timestep -. The Q-Net values are fed into the action selection which selects both the environment and communication actions, sending a message to Agent 2 mt-1 and action or set of actions to the Environment. At timestep t+1, Agent 2 receives the message from agent 1 and the observations from the environment, processes the action selection and returns it to the environment.
  • RIAL . This communication and action selection protocol involves several agents in a t and t+1 timesteps, in which the messages are trained and used for the action selection process. RIAL is RL based communication that combines Deep Recurrent Q-Networks with independent Q-learning for action communication selection, meaning that the network Qa is split into Qau -for the environment- and Qam — for action communication- in order to reduce action selection complexity.

All the learning loop involves a t and t+1 timesteps and Gradients flow only through the Q-network of a single agent. RIAL can be configured for parameter serving, reducing the number of training parameters. However, in RIAL the agents do not give feedback about their communication actions.

  • DIAL share the same idea as RIAL but change how gradient flow is calculated and executed, as gradients are pushed from one agent to another through the communication channel.
The Discretize/ Regularize unit is the main difference in between RIAL and DIAL, in which there is a connection in between the messages in the different neural networks of the agents.

At a time step t, C_net Agent 1 output is both the Q actions for the environment and the message m. Here comes the difference: instead of being fed into the action selector, is fed into the discretize/regularize unit DRU(ma ) that acts in centralized learning with regularization and discretization with decentralized execution: a situation in which several agents are able to learn simultaneously and execute their own actions independently.

2 ) Social Influence as Intrinsic Motivation for Multi-Agent Reinforcement Learning

Introduces the innovation of decentralized learning, as previously seen only decentralized execution. This method stands for giving the agent´s an intrinsic reward for having an influence on other´s agent actions, creating possible alternatives to events that have already occurred. These actions that could have been taken, if shown a better result, are rewarded. So in this case communication has a direct influence in MA-MDPs. Ultimately and in a higher abstract level, this paper comes to address how influence can have an impact on coordination and communication.

The idea of this approach goes beyond the classic literature of “doing by talking” or “talking by doing”, and guess it tries to go for “guessing by observing what the other is doing or might do” or, ultimately “ guessing by observing what the other might have done” high-level idea.

In this case, each agent is equipped with a trained Neural Network that represents a Model of Other Agents -MOA- in a competitive or cooperative setup. The actions of all agents are combined and each agent receives its own reward which may depend on the actions of other agents.

The paper divides itself into three different experiments: Basic Influence, Influential Communication, and Modeling other agents. This separation gives several different experiments working with two different environments: Cleanup and Harvest .

2A ) Basic Social Influence

In a first baseline influence experiment, A3C agent is compared to a pruned version of the influence setup: in the longtail shows promising results, in which a composed reward — influence + environmental — reward has been set up. In this case, a new set of probability with different actions is calculated, sampling counterfactual actions, using centralized training and assuming that influence is unidirectional.

we can see as influencer/speaker purple agent vs yellow influenced/listener agent: the behavior trained with social influence reward doesn´t move unless there are apples and yellow agent is highly influenced by its behavior.

KEY: Basic Influence introduces a combined extrinsic or environmental reward and a causal influence reward.

2B ) Influential Communication

After the results of the baseline social influence experiment, the message -discrete communication symbol- is trained to obtain the policies. This influential communication protocol works at various levels :

In one hand, two heads are trained with two different policies and value functions; one for the environment and the other one as a speculating policy for emitting communication symbols.

The topology of Influential communication, in which both the environment and the discrete message vector of all agents are trained sequentially

At Influential communication, the state is feed into a convolution and 2 fully connected layers. A last LSTM with the communication message is fed also with the communication message from the previous timestep.

Vm and πm have modified an agent´s immediate reward a sum of e -environmental reward- and c -causal influence reward-.

KEY: At Influential communication, 2 different policies are trained, one for the environment and the other for the communication speculative/reflective protocol.

A first approach to measuring effective communication might let you think that we shall measure better performance in terms of task reward, which is fundamentally true at a high level. However, the paper introduces new cognitive metrics in Influential communication in order to analyze communication behavior and measure its quality:

  • Speaker consistency [0,1]: consistency or trust in a speaker agent emitting a particular symbol when it takes a particular action. The goal of the metric is to measure how much of a 1:1 correspondence exists between a speaker´s action and the speaker´s communication message. More exactly, it evaluates the entropy of the probability of both actions given messages and messages given actions.
  • Instantaneous coordination (IC) measures how well agents are coordinating through communication. It works at two levels :
  • Symbol / Action IC measures the mutual information in between the influencer message and the influence next action. Influence through communication occurs when the agent decides to change its action based on the other agent´s message, and in those moments this metric is very high.
  • Action / Action IC measures the mutual information in between the influencer action and the influence next action.

Here you have some bullet points and lessons learn from here.

  • Influence is sparse in time
  • Listeners selectively listen to a speaker only when it´s beneficial
  • Agents that are the most influenced also achieve a higher individual environmental reward

From measuring in some experiments, the results show that listeners selectively listen to a speaker only when it´s beneficial and that agents that are the most influenced also achieve a higher individual environmental reward. Besides, the communication mt should contain information that helps the listener maximize its own environmental reward.

2C ) Influential Communication vs Model of Other Agents

MOA introduces a new topology. Here comes the innovation: achieving independent training by equipping each agent with its own internal Model of Other Agents, taking out centralized learning. MOA comes a set of layers that comes after the convolution and predicts all other agent´s next actions given its previous.

Once trained, it can be used to compute the social influence reward.

MOA learns both a policy and a supervised model that predicts the actions of other agents in the next timestep

As the environment becomes more complex and the communication message is different, the topology of the Neural Networks might be different, although in this case is the same.

KEY : Two Neural Networks compute the environmental policy and the model of the probability of actions

This paper is great, not only for the novel ideas that it brings but to the communication and evolution through the experiments, that allows knowing more about the research approach. The results show a better long tail performance of the Influential model-based communicating protocol

3 ) TARMAC : Targeted Multi-agent Communication

In this paper, a cooperative Multi-agent setting is settled in which effective communication protocol is key. Focusing on targeted communication with deep reinforcement learning, the agents learn targeted interactions — what messages to send and who to send them to — enabling a more flexible collaboration strategy in complex environments.

With targeted communications, the paper refers to direct certain messages to specific recipients, where agents learn both what messages to send and who to send them to. This communication is learned implicitly as a result of end-to-end training using a task-specific team reward. The difference in between the previous paper is that agents communicate via continuous vectors rather than discrete symbols.

At every timestep, each agent receives an input in the form of observation (wt) and aggregated continuous message (ct) and predicts an environment action and a targeted communication message (mt). The different messages of the different agents are aggregated into a unique message.

Targeted, Multi-Stage Communication

The multi-stage communication protocol proposes an attention mechanism. Each agent has a message that consists of 2 parts: a signature k used to encode agent-specific information and a value v which contains the actual message. Besides, The vector prediction q comes from the hidden state .

The signature and query value are processed in order to obtain an attention weight for each value vector. The resulting aggregated message is processed by the receiver.

Thanks for reaching this point!

This medium article has been made as part of my ICML 2019 papers ripped list. If you might want to add something or if u attend and want to discuss and talk about MARL, ping me on twitter. ! :)

--

--