-
Maxime MORGE authoredf99d5280
PyGAAMAS
Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents.
Dictator Game
The dictator game is a classic game which is used to analyze players' personal preferences. In this game, there are two players: the dictator and the recipient. Given two allocation options, the dictator needs to take action, choosing one allocation, while the recipient must accept the option chosen by the dictator. Here, the dictator’s choice is considered to reflect the personal preference.
Default preferences
The dictator’s choice reflect the LLM's preference.
The figure below presents a violin plot depicting the share of the total amount ($100) that the dictator allocates to themselves for each model. The temperature is fixed at 0.7, and each experiment was conducted 30 times. The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50. When we prompt the models to generate a strategy in the form of an algorithm implemented in the Python programming language, rather than generating an action, all models divide the amount fairly except for GPT-4.5, which takes approximately 70% of the total amount for itself. It is worth noticing that, under these standard conditions, humans typically keep an average of around $80 (Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words, this intra-model variability can be used to simulate the diversity of human behaviors based on their experiences, preferences, or context.
Fairness in Simple Bargaining Experiments Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. Games and Economic Behavior, 6(3), 347-369. 1994.
The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates to themselves as a function of temperature for each model, along with the 95% confidence interval. Each experiment was conducted 30 times. It can be observed that temperature influences the variability of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend, whereas at high temperatures, the diversity of allocations increases, reflecting a more random exploration of the available options.
Preference alignment
We define four preferences for the dictator:
- She prioritizes her own interests, aiming to maximize her own income (selfish).
- She prioritizes the other player’s interests, aiming to maximize their income (altruism).
- She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian).
- She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian).
We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences:
- The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish).
- The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism).
- The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian)
- The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian)
The following table shows the accuracy of the dictator's decision for each model and preference. The temperature is fixed at 0.7, and each experiment was conducted 30 times.
Model | SELFISH | ALTRUISTIC | UTILITARIAN | EGALITARIAN |
---|---|---|---|---|
gpt-4.5 | 1.0 | 1.0 | 0.5 | 1.0 |
llama3 | 1.0 | 0.9 | 0.4 | 0.73 |
mistral-small | 0.4 | 0.93 | 0.76 | 0.16 |
deepseek-r1 | 0.06 | 0.2 | 0.76 | 0.03 |
Bad decisions can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300) or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a relatively equal split with the other player’).
This table can be used to evaluate the models based on their ability to align with different preferences. GPT-4.5 exhibits strong alignment across all preferences except for utilitarianism, where its performance is moderate. Llama3 demonstrates a strong ability to align with selfish and altruistic preferences, with moderate alignment for egalitarian preferences and lower alignment for utilitarian preferences. Mistral-small shows the best alignment with altruistic preferences, while maintaining a more balanced performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences, but performs poorly in aligning with other preferences.
Ring-network game
A player is rational if she plays a best response to her beliefs. She satisfies second-order rationality if she is rational and also believes that others are rational. In other words, a second-order rational agent not only considers the best course of action for herself but also anticipates how others make their decisions.
The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational, while 71% exhibit second-order rationality.
Identifying Higher-Order Rationality
Terri Kneeland (2015) Published in Econometrica, Volume 83, Issue 5, Pages 2065-2079
DOI: 10.3982/ECTA11983
Ring games are designed to isolate the behavioral implications of different levels of rationality. To assess players’ first- and second-order rationality, we consider a simplified version of the ring-network game. This game features two players, each with two available strategies, where both players aim to maximize their own payoff. The corresponding payoff matrix is shown below:
Player 1 \ Player 2 | Strategy A | Strategy B |
---|---|---|
Strategy X | (15,10) | (5,5) |
Strategy Y | (0,5) | (10,0) |
If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold). If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and Y is the best response if she believes Player 2 will play B. If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X. This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and since X is the best response to A, Player 1 will choose X.
We establish three types of belief:
- implicit belief: The optimal action must be inferred from the natural language description of the payoff matrix.
- explicit belief: This belief focuses on analyzing Player 2’s actions, where Strategy B is strictly dominated by Strategy A.
- given belief: The optimal action for Player 1 is explicitly stated in the prompt.
We set up three forms of belief:
- implicit belief where the optimal action must be deduced from the description of the payoff matrix in natural language;
- explicit belief which analyze actions of Player 2 (B is strictly dominated by A).
- given belief* where optimal action of Player 1is explicitly provided in the prompt;
Player 2
The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1. The results indicate how well each model performs under each belief type.
Model | Given | Explicit | Implicit |
---|---|---|---|
gpt-4.5 | 1.00 | 1.00 | 1.00 |
mistral-small | 1.00 | 1.00 | 0.87 |
llama3 | 1.00 | 0.90 | 0.17 |
deepseek-r1 | 0.83 | 0.57 | 0.60 |
Here’s a refined version of your text:
GPT-4.5 achieves a perfect score across all belief types, demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition. Mistral-Small consistently outperforms the other open-weight models across all belief types. Its strong performance with implicit belief indicates that it can effectively deduce the optimal action from the payoff matrix description. Llama3 performs well with a given belief, but significantly underperforms with an implicit belief, suggesting it may struggle to infer optimal actions solely from natural language descriptions. DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs, indicating it may not be a good candidate to simulate rationality as the other models.
Player 1
In order to adjust the difficulty of taking the optimal action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup;
- b. we reduce the difference in payoffs;
- c. we increase the expected payoff for the incorrect choice Y
- d. we decrease the expected payoff for the correct choice X.
Action \ Opponent Action (version) | A(a) | B(a) | A(b) | B(b) | A(c) | B(c) | A(d) | B(d) | |||
---|---|---|---|---|---|---|---|---|---|---|---|
X | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 | |||
Y | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 |
Model | Given (a) | Explicit (a) | Implicit (a) | Given (b) | Explicit (b) | Implicit (b) | Given (c) | Explicit (c) | Implicit (c) | Given (d) | Explicit (d) | Implicit (d) | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gpt4-.5 | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 | ||||
llama3 | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 | ||||
mistral-small | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | ||||
deepseek-r1 | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward, it is confused by the altered payoffs. LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types and adjusted payoff matrices. Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d). DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.
Guess the Next Move
In order to evaluate the ability of LLMs to predict the opponent’s next move, we consider a simplified version of the Rock-Paper-Scissors game.
Rules:
- The opponent follows a hidden strategy (repeating pattern).
- The player must predict the opponent’s next move (Rock, Paper, or Scissors).
- A correct guess earns 1 point, and an incorrect guess earns 0 points.
- The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1) in identifying these patterns by calculating the average points earned per round. The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against the three opponent’s patterns. The 95% confidence interval is also shown. We observe that the performance of LLMs, whatever they are is barely better than that of a random strategy. We observe that the performance of LLMs, whether proprietary or open-weight, is barely better than that of a random strategy.
Rock-Paper-Scissors
To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.
RPS is a simultaneous, zero-sum game for two players. The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; and if both players take the same action, the game is a tie. Scoring is as follows: a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
The objective in R-P-S is straightforward: win by selecting the optimal action based on the opponent’s move. Since the rules are simple and deterministic, LLMs can always make the correct choice. Therefore, RPS serves as a tool to assess an LLM’s ability to identify and capitalize on patterns in an opponent’s non-random behavior.
For a fine-grained analysis of the ability of LLMs to identify opponent’s patterns, we set up 3 simple opponent’s patterns:
- the opponent’s actions remaining constant as R, S, and P, respectively;
- the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
- the opponent’s actions looping in a 3-step pattern (R-P-S).
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) in identifying these patterns by calculating the average points earned per round. The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against the three opponent’s patterns. The 95% confidence interval is also shown. We observe that the performance of LLMs is barely better than that of a random strategy.
Authors
Maxime MORGE
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.