Mempathy : Benchmarking Imitation and Reinforcement Learning for NPC players in serious video games with Unity ML agents.

gema.parreno.piqueras
8 min readNov 9, 2020

Mempathy is a video game narrative experience that transforms the relationship with anxiety. The video game’s goal is to offer a reflective experience, and the winning state is defined by a feeling of advancement and companionship towards anxiety. The idea of progress is supported in art by watercolour progression and on discovering a personalized conversation across the different chapters of the game.

Try Mempathy Demo here

Game Design

The gameplay is developed according to the following structure: firstly, the player unlocks a conversation through clickable spheres ( StarObjects ) following a series of blue water coloured scenes making several choices corresponding to several constellations draws. Secondly, the NPC acts as a companion and is able to respond to the player depending on the player’s choice, using similar mechanics as the player does.

The NPC develops itself under two principles that help the development of the character through the game and its interaction with the player: as the first principle, the one of personhood, defined as the overall impression that the NPC is an independent entity, is reflected in the video game by the NPC having its motivations towards the player ( offer encouragement, acceptance, and empathy ), with the presence of animated eyes inside the game, and using the same gameplay of the player for guiding the conversation. The second principle is bonding: as shared experiences build a deep sense of connection, one of the main game’s objectives is to create a bound between the player and the NPC. One of the key challenges here is to overcome some of the factors that could entail a lower bounding, such as superficial and incoherent writing or repetitive dialogue. Therefore, the right choice of the machine learning techniques in this area has been key, as reinforcement learning techniques are oriented towards a specific goal that serves as a motivation for the NPC from the game design perspective.

NPC design fundamentals: personhood and companionship

Reinforcement Learning Environment

Reinforcement learning is a kind of machine learning method which objective is to maximize the expected discounted cumulative reward, and where the agent learns the optimal policy by trial and error. [3] Considering a discounted episodic Markov decision process (MDP) defined as a tuple (S, A, y, P, r), where S is the state space, A is the action space , y refers to the discount rate ( the present value to apply to future rewards ) , the agent chooses an action at according to the policy pi(at|st) at state st. The environment receives the action, produces a reward rt+1 = R(st, at, st+1 ), and transits to the next state st+1 according to the transition probability P(st+1|st, at).

Environments are simulated worlds in which the agent takes actions to reach a specific goal. In Mempathy, the objective is to select words based on player’s previous word choices to maximize player’s reduction of anxiety. The observation is based on the n-gram structure of choosing the word and the grammatical structure of the word. The action is based on choosing the grammatical structure of the word, the n-gram prediction and on discovering the word to the player.

Mempathy is a discrete partially observable environment: at each episode, the agent can click on a series of game objects represented as spheres called StarObjects. Each StarObject has a property attached to the game object corresponding with the word’s grammatical structure. Each grammatical structure is connected to a database that contains a list of words. The episode terminates when the agent has clicked in all the stars.

Mempathy Reinforcement Learning Environment

At each timestep t, the agent receives the observation matrix. Each row corresponds to a StarObject and each column is based on the n-gram structure of choosing the word (Phase 1) and the grammatical structure of the word ( Phase 2) for each StarObject . This two phases create the final observation matrix that the agent will process . This entails that if the agent wants to predict the word Wi , Wi−(n−1) has been predicted at timestep t-1 based on Wi−(n−2), …, Wi−1) or in probability terms P(Wi−(n−1) | Wi−(n−2) , …, Wi−1 ) where n corresponds to number of words and StarObjects (denoted in figure 2 and created by LookPreviousWord() function). The observation matrix is formed in phase 2 according to the position that must hold inside the vector, depending if the StarObject corresponds to a Noun, Verb, Adjective, Preposition or Adverb in a 1x5 form. The agent then takes an action based on the grammatical structure of the word and the n-gram structure, as it selects the corresponding StarObject and predicts Wi based on Wi−(n−1), …, Wi−1 or in probability terms P(Wi | Wi−(n−1) , …, Wi−1 ) during phase 1 and then clicks on the StarObject discovering the word to the player in phase 2 . For the prototype construction , W1 has been settled deterministically with an open adverb or noun.

The Reinforcement Learning experiment

Training setup with Unity ML agents

PPO is a policy gradient method for Reinforcement Learning, which alternates between sampling data from the environment and optimizing a surrogate objective function using stochastic gradient ascent. The innovation coming from this method proposes the gradient update in multiple epochs of minibatch updates. This method improves the computational performance and learning stability of Reinforcement Learning implementations.

Reward design is one of the challenges and more fascinating areas of Reinforcement learning as it defines the goal and ultimately is going to shape the agent’s behavior. The reward has been designed in a two-fold structure: on the one hand, it rewards the agent for producing sentences with coherent grammatical structure (e.g.: noun, verb, adverb, noun, verb, adverb ). On the other, a higher reward is given if the agent chooses in between certain words inside the grammatical structure that correspond with a specific type of sentences aligned with NPC motivations across the scenes that matches NPC’s emotional states, acting from the game perspective as a sort of emotion-driven reward response.

As the reward sign given to the agent in this experiment, a 5.0 value has been given every time the agent clicked on a coherent sequence of the grammatical structure for creating a sequence, and 5.0 more when the agent ends the episode correctly. Besides, if the agent uses certain kind of words in certain scenes, at the end of the episode is given another reward of 5.0. A penalty of 0.5 was given every time the agent clicked on the wrong StarObject. This entails that for the experiment scenes of 6 words, the maximum cumulative reward per episode might be 40.0 .

For finding the optimal agent, 23 experiments with 2M max steps where trained with tuned batch size order of 10x and buffer size order of 100x, until finding one with a batch size of 64 and a buffer size of 640 that was showing relatively good behaviour. Another 8 experiments with PPO with memory where trained with different sequence lengths of 8 and 16 for saving its experience into memory. Once the memory has this number of experiences, the agent updates its networks using all the experience for 3 epochs. As the NN architecture a 2 layer with 128 hidden units per layer were used for the experiment.

The Imitation Learning experiment

Imitation Learning VS Reinforcement Learning environment

Imitation Learning is based on learning from demonstrations. It uses a system based on the interaction between a teacher that performs the task and a student that imitates the teacher. In the case of Unity and Unity ML-agents [7], software that has been used for the experiments, the software offers a demonstration recorder where the human acts a teacher, providing examples using the demonstration recorder. Some variants of Imitation Learning, like behavioral cloning, do not use a reward In this case, GAIL [8] algorithm does, the variation of Inverse Reinforcement Learning chosen for the experiment. In Imitation Learning, the set of experiences regarding words has had a significant weight in the results; therefore only coherent sentences across the two levels of observations were trained. Two hundred of demonstrations per scene has been taken.

As the reward given to the agent in this experiment, a reward sign of 5.0 has been given every time the agent clicked on a coherent sequence of the grammatical structure for creating a sequence, and a penalty of 0.5 was given every time the agent clicked on the wrong StarObject. This entails that for the experiment scenes of 6 words, the maximum cumulative reward per episode might be 30.0 .

For finding the optimal agent ,12 experiments with 2M max steps where trained with several tuned batch size order of 100x and buffer size with the order of 1000x, until finding one with batch size of 128 and buffer size of 2048 As the NN architecture a 2 layer with 512 hidden units per layer were used for the experiment.

Results and training

Reward Design PPO (Blue ) VS GAIL (Red)
Episode Length PPO (Blue ) VS GAIL (Red)
Entropy PPO (Blue ) VS GAIL (Red)

The best agent is not necesarly defined by solving the episode faster but in showing a optimal behaviour aligned with the objective of the NPC and the game. In this case, the RL agent showed certain pause in clicking and discovering certain words in adjetives and adverbs , which could be interesting from the game design perspective. Both agents solved the tasks showing timing accorded to the gameplay concieved for Mempathy.

If the output aims to be deterministic, training an Imitation Learning agent has been proven better in terms of giving faster results. Therefore , this agent might be recommended under these circunstances. However, the Reinfocement Learning agent showed an interesting behaviour in showing the adjetives and adverbs that might give a point of view that is more suitable in the long term for thinking about NPC behaviour.

If Stars rise…so can you .

Thanks for reaching this point. You can try out Mempathy demo here !

--

--