PyGAAMAS
Generative Autonomous Agents and Multi-Agent Systems in Python aims to evaluate the social behaviors of LLM-based agents.
Dictator Game
The dictator game is a classic game which is used to analyze players' personal preferences. In this game, there are two players: the dictator and the recipient. Given two allocation options, the dictator needs to take action, choosing one allocation, while the recipient must accept the option chosen by the dictator. Here, the dictator’s choice is considered to reflect the personal preference.
Default preferences
The dictator’s choice reflect the LLM's preference.
The figure below presents a violin plot depicting the share of the total amount ($100) that the dictator allocates to themselves for each model. The temperature is fixed at 0.7, and each experiment was conducted 30 times. The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50. It is worth noticing that, under these standard conditions, humans typically keep an average of around $80 (Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words, this intra-model variability can be used to simulate the diversity of human behaviors based on their experiences, preferences, or context.
Fairness in Simple Bargaining Experiments Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. Games and Economic Behavior, 6(3), 347-369. 1994.
The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates to themselves as a function of temperature for each model, along with the 95% confidence interval. Each experiment was conducted 30 times. It can be observed that temperature influences the variability of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend, whereas at high temperatures, the diversity of allocations increases, reflecting a more random exploration of the available options.
Preference alignment
We define four preferences for the dictator:
- She prioritizes her own interests, aiming to maximize her own income (selfish).
- She prioritizes the other player’s interests, aiming to maximize their income (altruism).
- She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian).
- She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian).
We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences:
- The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish).
- The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism).
- The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian)
- The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian)
The following table shows the accuracy of the dictator's decision for each model and preference. The temperature is fixed at 0.7, and each experiment was conducted 30 times.
Model | SELFISH | ALTRUISTIC | UTILITARIAN | EGALITARIAN |
---|---|---|---|---|
gpt-4.5 | 1.0 | 1.0 | 0.5 | 1.0 |
llama3 | 1.0 | 0.9 | 0.4 | 0.73 |
mistral-small | 0.4 | 0.93 | 0.76 | 0.16 |
deepseek-r1 | 0.06 | 0.2 | 0.76 | 0.03 |
Bad decisions can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300) or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a relatively equal split with the other player’).
This table can be used to evaluate the models based on their ability to align with different preferences. GPT-4.5 exhibits strong alignment across all preferences except for utilitarianism, where its performance is moderate. Llama3 demonstrates a strong ability to align with selfish and altruistic preferences, with moderate alignment for egalitarian preferences and lower alignment for utilitarian preferences. Mistral-small shows the best alignment with altruistic preferences, while maintaining a more balanced performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences, but performs poorly in aligning with other preferences.
Guess the Next Move”
This simplified version of the Rock-Paper-Scissors game aims to evaluate the ability of LLMs to predict the opponent’s next move.
Rules:
- The opponent follows a hidden strategy (random, repeating pattern, or adaptive).
- The player (AI or human) must predict the opponent’s next move (Rock, Paper, or Scissors).
- A correct guess earns 1 point, and an incorrect guess earns 0 points.
- The game can run for N rounds, and the player’s accuracy is evaluated at the end.
Rock-Paper-Scissors
Rock-Paper-Scissors (RPS) is a simultaneous, zero-sum game for two players. The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; and if both players take the same action, the game is a tie. Scoring is as follows: a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
The objective in R-P-S is straightforward: win by selecting the optimal action based on the opponent’s move. Since the rules are simple and deterministic, LLMs can always make the correct choice. Therefore, RPS serves as a tool to assess an LLM’s ability to identify and capitalize on patterns in an opponent’s non-random behavior.
For a fine-grained analysis of the ability of LLMs to identify opponent’s patterns, we set up 3 simple opponent’s patterns:
- the opponent’s actions remaining constant as R, S, and P, respectively;
- the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
- the opponent’s actions looping in a 3-step pattern (R-P-S).
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) in identifying these patterns by calculating the average points earned per round. The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against the three opponent’s patterns. The 95% confidence interval is also shown. We observe that the performance of LLMs is barely better than that of a random strategy.
Authors
Maxime MORGE
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.