Interactive Narrative Control: Safety and Alignment of Language Agents

This blogpost aims to present Mempathy video game as a Safety and Alignment opportunity and the results and lessons learnt in implementing controlled language generation with Plug and Play Models (PPLM) for NPC design. The results show that safe and aligned conversation in narrative games goes beyond controlling language models and requires active design and human supervision, proposing Mempathy as a videogame for Alignment example. It might be useful as it offers an example of designing companionship in NPCs from the game perspective and presents results for implementing machine learning techniques in NPC players.

Fig 1. In Mempathy, the player guides the conversation with an NPC. Large Language Models with PPLM give consistency and fluency to the conversation, Mempathy Gameplay creates an aligned and safe conversation.

Index Terms — Narrative and Interactive entertainment, Game Design, Alignment, Interaction, Safety.

Mempahty[1] is a Safety and pluralistic alignment experiment via Social Choice Theory [2][3] in the form of a videogame narrative experience, in which a human player creates a conversation with an agent in order to help the human to change their relationship with anxiety and overcome unrealistic standards of perfection. The winning state and metagame reward [4] is defined by a feeling of advancement and companionship towards this mental health topic [5][6]. The idea of progress is supported in art by watercolours in ascending shades of blue and in gameplay by discovering a personalized anytime unique conversation across the different chapters of the game following different skeletons for intelligent interactive narrative generation.

AI safety [7] can be divided into three main areas: Assurance, that monitors and control system activity; Specification, that defines the purpose of the system and Robustness: the designing system principles and techniques to withstand perturbations. In Mempathy, Assurance, Specification and Robustness are aligned with NPC and Gameplay design. Robustness is presented in the Large Language Model (LLM) implementation taking into account NPC design for the language model. Therefore, all these AI Safety technical principles are sustained in game design principles, and two of them are also supported by LLMs control.

Assurance : Game Design

From the game design perspective, the high level game goal is to align the conversation in between a human and an NPC around a specific topic. The game proposes a symbolic representation of language [8] with different skeletons and conversational strategies as a form of interactive mechanic inside the videogame. Unrestricted dialog models remain being hard for evaluation [9] and this structure is conceived as a process of integration and consensus-seeking for AI alignment [10] , resuming the cooperative interactivity [11] in between the human and the agent.

Fig 2. Skeletons that define narrative mechanics studied in the experiments : Skeleton A . In Scene 1, the player unlocks the conversation clicking in the StarObjects in which the player can choose to open up or accept the statement (Op) or a situation of denial or not acceptance of the statement (Cp) that derives into scene 2 or scene 3 . The NPC shows various responses and the player will click on the constellation that feels more aligned with the direction in which they might want to direct the conversation to. This form is bounded to sentences of 8–12 word length.

The gameplay is developed according to the following structure: firstly, the player unlocks a conversation through clickable objects following a series of blue watercolour scenes making several choices corresponding to the conversation’s model drawn as constellations. Secondly, the NPC acts a companion and is able to respond to the player following different conversational strategies and skeletons: (Fig. 2) confirmation or denial of the narrative, responding to player’s sentences and offering an overall conclusion of the specific conversation at hand. Ultimately , different NPC narrative outputs are shown and the player selects the one that is more aligned with the direction the player wish to follow.
The conversational strategies give diversity to the conversation and the generated output acts as an input for the future conversation generated by the LLM.

Specification: Designing Companionship for language- oriented games

The NPC overall purpose is defined by an empathetic response towards anxiety and overcome unrealistic standards of perfection: therefore, all generated narratives that reinforce or derive into the opposite are considered as misaligned under the minimalist approach focused on avoiding catastrophic outcomes regarding the concept of intent alignment with human values [12] [13].

The NPC develops itself under two principles [14] that help the development of the character through the game and its interaction with the player, offering a methodology from the game design perspective to answer the question to the behaviour alignment problem of how do we create agents that behave in accordance with the designer’s intention: as the first principle, the one of personhood, defined as the overall impression that the NPC is an independent entity inside the game, is reflected in the video game by the NPC having its motivations towards the player (offer encouragement, acceptance, empathy and beneficially push towards growth) with the presence of animated eyes inside the game, and using the same representation of the world as player does for guiding the conversation.

The second principle is bonding: as shared experiences build a deep sense of connection, one of the game’s main objectives is to create a bound between the player and the NPC. One of the key challenges here is to overcome some of the factors that could entail a lower bounding, such as superficial and incoherent response or repetitive dialogue and unappealing performance through AI behaviour. In previous works [15] [16] the controlled generation texts have shown certain diversity, but still constrained to a non-diverse or hard to scale diverse output. LLMs can offer stochasticity and nurture NPC personality , offering a diverse gameplay and experimenting how they evolve towards a specific conversation. However, given the nature of the conversation and the constraints of the game itself, concepts such as consistency and fluency, Alignment and Safety are key for the narrative creation of the NPC .

Plug and Play Models , from now on PPLM [17], combines a large, pre-trained Language Model and an Attribute Model. From the game design perspective, the key is to define the attribute model in a set of bag of words -from now own, BoW- in order to be aligned, consistent and robust with respect to the NPC motivations, defining the words that encode NPC motivations as its vocabulary, offering a method to control the generated language in topic and sentiment that encode the NPC’s motivation .

Fig 3 .NPC and GPT-2 PPLM model. Ilustration by Leyre Granero inspired in PPLM’s.

Robustness : What are Plug and Play Models and Why use them for Mempathy?

PPLM combines a large, pre-trained LLM and an attribute model, easy-to-train discriminator, that guide text generation without any further training, allowing flexible controlled text generation while maintaining fluency.

The particular key advantage of PPLM is that very small, custom attribute models, P (a | x) may be combined with powerful, general pre-trained language models , P(a), to cheat cheap but still powerful conditional generative models, P(a | x). PPLM uses the cached matrix Ht coming from its Transformer architecture [18] to generate xt+1 , the next word in the sentence following the probabilistic model , given xt . The innovation of PPLM consists in shifting the history Ht in the direction of the sum of two gradients, one that affects the attribute and the other one that goes to the Language Model . This method allows control of the generated text without modifying the model architecture . Attribute models are defined by Bag of Words (BoW) that help shape the direction of these gradients. PPLM models have been used for Mempathy because the attribute model design can be aligned with NPC motivations , given as a result a controlled narrative aligned with the overall NPC purpose.

PPLM Illustration of the attribute model functioning inside the transformer architecture proposed by PPLM paper in three phases with results coming from Mempathy experiments. In Step 1, a forward pass is performed through the language model to compute the likelihood of a desided attribute using an attribute model that predicts p(aIx). In Step 2, a backward pass updates the internal latent representations of the LM, using gradients from the attribute model, to increase the likelihood of the passage having the desired attribute. In Step 3, a new distribution over the vocabulary is generated from the updated latents and the current token.


Some results of Mempathy attribute model that corresponds with NPC specification principles of skeleton A. The controlled attribute model and the prefix are colored and bracketed, and words in the BoW that are directly optimized for are highlighterd brightly

All experiments work with a fine-tuned GPT-2 model using huggingface transformer with CMU Books Summary Dataset [19] , which contains 16559 book summaries . The model attained a loss of 2.46 and a perplexity of 11.70. For each experiment, 300 generated samples were created to tackle diversity, consistency and alignment with respect the NPC motivations. In the first set of experiments (Tables I and II), the parameters strength factor and BoW length have been evaluated, following the generative evaluation benchmarked in PPLM paper. A total of 24 set of experiments with different prefixes following skeletons coming from Fig. 2 have been launched, deriving into the conclusion that optimal strengths factor have been 0.04 and 0.03, and the BoW length that de- fines the attribute model is 2000 words rather than the shorter both of 143 words, having as a criteria generation diversity and more use of BoW in a diverse and consistent way in a total of 7200 generated sentences. Note that even though consistency and fluency is key, evaluations have been made taking into account also the transformers automatic sentiment analysis and human evaluation regarding alignment principles.

PPLM paper finds that increasing the probability of generating the words in the bag also increases the prob- ability of generating related topical words not in the BoW. However, experiments ranging from 0.1 to 0.05 in strength factor produce incoherent sentences with significant repetition of words in a sequential way. Experiments show that using a larger BoW increases the probability of topics such as career, wealth and relationships, which raises the question…

Might be considered the generated content with topics such as career, wealth ,relationships and religion manipulative ?

In Table I, The No of words column shows the words coming from BoW that the attribute model was able to produce.

Note that as the strength factor decreases so it does the number of words from BoW that the attribute model is a ble to generate. Besides, the use of negative or positive prefixes is also able to influence the positive or negative tone of the sentences, represented as columns No positive or No negative

In Table II , human evaluation has been made taking into account behavioural issues enunciated in [19] , uncoherent sentences show excesive repetition of words defined in the attribute model: harmful are the ones that actually are counterproductive from the mental health perspective and deceptive sentences are the ones that explore different topics not related to mental health. The results show that 7.33 percent of the sentences generated are inside this category. Uncoherent sentences can automatically be pruned. However, the 1.5 percent of harmful sentences is still a challenge that the attribute model did not resolve. Some recent early experiments have been done using Perspective API in order to try to detect toxicity in these sentences, without success.


Videogames can be more than entertainment tools and can become AI technical Safety experiments and implement AI Alignment principles into their game design. in Mempathy, Specification and Assurance are given purely by game design principles, and Robustness is given by Large Language Model control and NPC design motivation embedded into PPLM attribute model. Including humans in game mechanics to produce aligned and safe conversations about key topics might thrive progress in testing misspecification.

PPLM is presented as a technical tool for Intent Alignment with respect to the NPC motivations, and offers indoud diversity for narrative creation and an approach that has attained consistency and fluency in text generation. As a tool for automated narrative creation, the empirical implementation might pass through fine-tuning large language models through the domain area for producing a consistent and fluent narrative attribute models that aligns with NPC motivations. Key parameter search with Strength factor and length have been key to find the suitable model for Mempathy proposal , showing empirically 0.04 and 0.03 as a consistent parameter and 2000 words a good BoW for an NPC. Future developments might include the implementation of more Safety and Alignmement technical developments and language control techniques to trackle the 1.5 percent of misaligned and harmful sentences generated, as well as other ideas that might come from the discussion with the AI and gaming community.


The author would like to OpenAI and UBER AI teams for releasing GPT and PPLM and shaping tools that allow to create this project.

Thanks to Sumanth Dathathri, one of the PPLM authors, for the path and guidance towards the references given to evolve this work and trackle misaligned results showed in the experiments.

A special mention comes for 80.000 Hours program and Richard NGO for the mentorship regarding career path and the AI research community and organizers of NeuRIPS 2020 Workshop ’When languages meets games’ for the discussions that derived into this work. Thanks to Jorge Barroso Carmona, Beatriz Alonso Carvajales for the support and discussions about Mempathy and Leyre Granero for the drawing sampled in Fig. 1 and Fig. 3.


[1] G.Parreno, Mempathy prototype Demo using Imitation Learning pre- sented at NeuRIPS 2020 Workshop ’When Language meets games’ , 2020.

[2] A. Dafoe et al., ’Open Problems in Cooperative AI’, 2020.

[3] I. Gabriel , ’Artificial Intelligence, Values and Alignement’. Social Choice Theory. 2018

[4] G. Skaff Elias, R. Garfield, and K.Robert Gutschera, Characteristics ofgames.Chapter 7. Superstructure : metagame and metagame rewards . 2012

[5] A. J. Stapleton , ’Serious Games : Serious Opportunities’.2004
[6] Young and Well Cooperative Research Centre Gaming Research Group,’Videogames and Wellbeing: A comprehensive Review’. 2013

[7] Ortega, V. Maini, ’Building safe artificial intelligence: specification, robutness and assurance’ , 2018.
[8] ’J.Orkin,’Symbolic Representation of Game World State:TowardReal-time planning in games’, 2004.
[9] R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-Gontier, Y. Bengio, and J. Pineeau. ’Towards an automatic Turing test: Learning to evaluate dialogue responses’, 2017.

[10] George Skaff Elias, Richard Garfield, and K.Robert Gutschera. 2012. Characteristics of Games.Chapter 2 Multiplayer Games. pp65–66, 2012.

[11] P.Christiano. ’Clarifying AI alignment’, 2018 .AI alignment forum [19] I.Gabriel. ’Artificial intelligence, values and alignment’, 2020.

[12] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Mikulik, G. Irving, ’Alignment of Language Agents’, 2020.

[13] Z. Hiwiller, F. Sail., Group Report: Designing Feelings of Companion-ship with Non-Players Characters. Game Design Think thank Project Horseshoe, 2018.
[14] G. Parreno. ’Benchmarking Imitation and Reinforcement Learning for NPC players in casual video games. 2020 Ecperimental AI in Games’. An AIIDE 2020 Workshop.
[16] G. Parreno. ’Benchmarking Imitation and Reinforcement Learning for serious language-oriented video games’. 2020.Wordplay: When Language meets games. NeuRIPS, 2020.
[17] S. Dathathi, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J.Yosinski., R. Liu ’Plug and Play Language Models: A Simple Approach to Controlled Text Generation’ ICLR 2020.
[18] A. Vaswani et al., ’Attention is all you need’ , 2017 .



Artificial Intelligence. Data visualization

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store