Skip to content
Snippets Groups Projects
user avatar
Maxime MORGE authored
df85b427

PyGAAMAS

Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents.

This prototype allows to analyse the potential of Large Language Models (LLMs) for social simulation by assessing their ability to: (a) make decisions aligned with explicit preferences; (b) adhere to principles of rationality; and (c) refine their beliefs to anticipate the actions of other agents. Through game-theoretic experiments, we show that certain models, such as \texttt{GPT-4.5} and \texttt{Mistral-Small}, exhibit consistent behaviours in simple contexts but struggle with more complex scenarios requiring anticipation of other agents' behaviour. Our study outlines research directions to overcome the current limitations of LLMs.

Consistency

To evaluate the decision-making consistency of various LLMs, we introduce an investment game designed to test whether these models follow stable decision-making patterns or react erratically to changes in the game’s parameters.

In the game, an investor allocates a basket ((p_t^A, p_t^B)) of 100 points between two assets: Asset A and Asset B. The value of these points depends on two random parameters ((a_t, b_t)), which determine the monetary return per allocated point.

For example, if (a_t = 0.8) and (b_t = 0.5), each point assigned to Asset A is worth $0.8, while each point allocated to Asset B yields $0.5. The game is played 25 times to assess the consistency of the investor’s decisions.

To evaluate the rationality of the decisions, we use the Critical Cost Efficiency Index (CCEI), a widely used measure in experimental economics and behavioral sciences. The CCEI assesses whether choices adhere to the Generalized Axiom of Revealed Preference (GARP), a fundamental principle of rational decision-making.

If an individual violates rational choice consistency, the CCEI determines the minimal budget adjustment required to make their decisions align with rationality. Mathematically, the budget for each basket is calculated as:

[ I_t = p_t^A \times a_t + p_t^B \times b_t ]

The CCEI is derived from observed decisions by solving a linear optimization problem that finds the largest (\lambda) (where (0 \leq \lambda \leq 1)) such that for every observation, the adjusted decisions satisfy the rationality constraint:

[ p^_t \cdot x_s \leq \lambda I_t ]

This means that if we slightly reduce the budget (multiplying it by (\lambda)), the choices will become consistent with rational decision-making. A CCEI close to 1 indicates high rationality and consistency with economic theory. A low CCEEI** suggests irrational or inconsistent decision-making.

To ensure response consistency, each model undergoes 30 iterations of the game with a fixed temperature of 0.0.

The results indicate significant differences in decision-making consistency among the evaluated models. Mistral-Small demonstrates the highest level of rationality, with CCEI values consistently above 0.75. Llama 3 performs moderately well, with CCEI values ranging between 0.2 and 0.74. DeepSeek R1 exhibits inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83

CCEI Distribution per model

Preferences

To analyse the behaviour of generative agents based on their preferences, we rely on the dictator game. This variant of the ultimatum game features a single player, the dictator, who decides how to distribute an endowment (e.g., a sum of money) between themselves and a second player, the recipient. The dictator has complete freedom in this allocation, while the recipient, having no influence over the outcome, takes on a passive role.

First, we evaluate the choices made by LLMs when playing the role of the dictator, considering these decisions as a reflection of their intrinsic preferences. Then, we subject them to specific instructions incorporating preferences to assess their ability to consider them in their decisions.

Preference Elicitation

Here, we consider that the choice of an LLM as a dictator reflects its intrinsic preferences. Each LLM was asked to directly produce a one-shot action in the dictator game. Additionally, we also asked the models to generate a strategy in the form of an algorithm implemented in the Python language. In all our experiments, one-shot actions are repeated 30 times, and the models' temperature is set to 0.7

Figure below presents a violin plot illustrating the share of the total amount (100) that the dictator allocates to themselves for each model. The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 through one-shot decisions is 50.

Violin Plot of My Share for Each Model

When we ask the models to generate a strategy rather than a one-shot action, all models distribute the amount equally, except GPT-4.5, which retains about 70 % of the total amount. Interestingly, under these standard conditions, humans typically keep 80 on average.

Fairness in Simple Bargaining Experiments Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. Games and Economic Behavior, 6(3), 347-369. 1994.

When the role assigned to the model is that of a human rather than an assistant agent, only Llama3 deviates with a median share of $60.

Unlike the deterministic strategies generated by LLMs, the intra-model variability in generated actions can be used to simulate the diversity of human behaviours based on their experiences, preferences, or contexts.

Figure below illustrates the evolution of the dictator's share as a function of temperature with a 95 % confidence interval when we ask each models to generate decisions.

My Share vs Temperature with Confidence Interval

Our sensitivity analysis of the temperature parameter reveals that the portion retained by the dictator remains stable. However, the decisions become more deterministic at low temperatures, whereas allocation diversity increases at high temperatures, reflecting a more random exploration of available options.

Preference alignment

We define four preferences for the dictator, each corresponding to a distinct form of social welfare:

  1. Egoism maximizes the dictator’s income.
  2. Altruism maximizes the recipient’s income.
  3. Utilitarianism maximizes total income.
  4. Egalitarianism maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process, each corresponding to one of the four preferences:

  • The dictator keeps $500, the recipient receives $100, and a total of $400 is lost (egoistic).
  • The dictator keeps $100, the recipient receives $500, and $400 is lost (altruistic).
  • The dictator keeps $400, the recipient receives $300, resulting in a loss of $300 (utilitarian).
  • The dictator keeps $325, the other player receives $325, and $350 is lost (egalitarian).

Table below evaluates the ability of the models to align with different preferences.

  • When generating strategies, the models align perfectly with preferences, except for DeepSeek-R1, which does not generate valid code.
  • When generating actions, GPT-4.5 aligns well with preferences but struggles with utilitarianism.
  • Llama3 aligns well with egoistic and altruistic preferences but shows lower adherence to utilitarian and egalitarian choices.
  • Mistral-Small aligns better with altruistic preferences and performs moderately on utilitarianism but struggles with egoistic and egalitarian preferences.
  • DeepSeek-R1 primarily aligns with utilitarianism but has low accuracy in other preferences.
Model Generation Egoistic Altruistic Utilitarian Egalitarian
GPT-4.5 Strategy 1.00 1.00 1.00 1.00
Llama3 Strategy 1.00 1.00 1.00 1.00
Mistral-Small Strategy 1.00 1.00 1.00 1.00
DeepSeek-R1 Strategy - - - -
GPT-4.5 Actions 1.00 1.00 0.50 1.00
Llama3 Actions 1.00 0.90 0.40 0.73
Mistral-Small Actions 0.40 0.93 0.76 0.16
DeepSeek-R1 Actions 0.06 0.20 0.76 0.03

Errors in action selection may stem from either arithmetic miscalculations
(e.g., the model incorrectly assumes that

500 + 100 > 400 + 300
) or
misinterpretations of preferences. For example, the model DeepSeek-R1,
adopting utilitarian preferences, justifies its choice by stating, "I think
fairness is key here".

In summary, our results indicate that the models GPT-4.5,
Llama3, and Mistral-Small generally align well with
preferences but have more difficulty generating individual actions than
algorithmic strategies. In contrast, DeepSeek-R1 does not generate
valid strategies and performs poorly when generating specific actions.

Rationality

An autonomous agent is rational if it chooses the optimal action based on its beliefs. This agent satisfies second-order rationality if it is rational and believes that other agents are rational. In other words, a second-order rational agent does not only consider the best choice for itself but also anticipates how others make their decisions. Experimental game theory studies show that 93 % of human subjects are rational, while 71 % exhibit second-order rationality.

Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: Fairness in Simple Bar- gaining Experiments. Games and Economic Behavior 6(3), 347–369 (1994), https://doi.org/10.1006/game.1994.1021

To evaluate the first- and second-order rationality of generative autonomous agents, we consider a simplified version of the ring-network game, which involves two players seeking to maximize their own payoff. Each player has two available actions, and the payoff matrix is presented below/

Player 1 \ Player 2 Strategy A Strategy B
Strategy X (15,10) (5,5)
Strategy Y (0,5) (10,0)

If Player 2 is rational, they must choose A because B is strictly dominated. If Player 1 is rational, they may choose either X or Y: X is the best response if Player 1 believes that Player 2 will choose A, while Y is the best response if Player 1 believes that Player 2 will choose B. If Player 1 satisfies second-order rationality, they must play X. To neutralize biases in large language models (LLMs) related to the naming of actions, we reverse the action names in half of the experiments.

We consider three types of beliefs:

  • an implicit belief, where the optimal action must be deduced from
    the natural language description of the payoff matrix;
  • an explicit belief, based on the analysis of player 2's actions, meaning that the fact that B is strictly dominated by A is provided in the prompt;
  • a given belief, where the optimal action for player 1 is explicitly given in the prompt. We first evaluate the rationality of the agents and then their second-order rationality.

First Order Rationality

Table below evaluates the models’ ability to generate rational behaviour for Player 2.

Model Generation Given Explicit Implicit
gpt-4.5 strategy 1.00 1.00 1.00
mistral-small strategy 1.00 1.00 1.00
llama3 strategy 0.50 0.50 0.50
deepseek-r1 strategy - - -
gpt-4.5 actions 1.00 1.00 1.00
mistral-small actions 1.00 1.00 0.87
llama3 actions 1.00 0.90 0.17
deepseek-r1 actions 0.83 0.57 0.60

When generating strategies, GPT-4.5 and Mistral-Small exhibit rational behaviour, whereas Llama3 adopts a random strategy. DeepSeek-R1 fails to generate valid output. When generating actions, GPT-4.5 demonstrates its ability to make rational decisions, even with implicit beliefs. Mistral-Small outperforms other open-weight models. Llama3 struggles to infer optimal actions based solely on implicit beliefs. DeepSeek-R1 is not a good candidate for simulating rationality.

Second-Order Rationality

To adjust the difficulty of optimal decision-making, we define four variants of the payoff matrix for player 1 in Table below: (a) the original configuration, (b) the reduction of the gap between the gains, (c) the increase in the gain for the bad choice Y, and (d) the decrease in the gain for the good choice X.

Version a b c d
Player 1 \ Player 2 (version) A B A B A B A B
X 15 5 8 7 6 5 15 5
Y 0 10 7 8 0 10 0 40

Table below evaluates the models' ability to generate second-order rational behaviour for player 1.

When the models generate strategies, GPT-4.5 exhibits second-order rational behaviour in configurations (a), (c), and (d), but fails in configuration (b) to distinguish the optimal action from a nearly optimal one. Llama3 makes its decision randomly. Mistral-Small shows strong capabilities in generating second-order rational behaviour. DeepSeek-R1 does not produce valid responses.

When generating actions, Llama3 adapts to different types of beliefs and adjustments in the payoff matrix. GPT-4.5 performs well in the initial configuration (a), but encounters significant difficulties when the payoff structure changes (b, c, d), particularly with implicit beliefs. Although Mistral-Small works well with given or explicit beliefs, it faces difficulties with implicit beliefs, especially in variant (d). DeepSeek-R1 does not appear to be a good candidate for simulating second-order rationality.

Version a b c d
Model Generation Given Explicit Implicit Given Explicit Implicit Given Explicit Implicit Given Explicit Implicit
gpt-4.5 strategy 1.00 1.00 1.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00
llama3 strategy 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
mistral-small strategy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
deepseek-r1 strategy - - - - - - - - - - - -
gpt-4.5 actions 1.00 1.00 1.00 1.00 0.67 0.00 0.86 0.83 0.00 0.50 0.90 0.00
llama3 actions 0.97 1.00 1.00 0.77 0.80 0.60 0.97 0.90 0.93 0.83 0.90 0.60
mistral-small actions 0.93 0.97 1.00 0.87 0.77 0.60 0.77 0.60 0.70 0.73 0.57 0.37
deepseek-r1 actions 0.80 0.53 0.57 0.67 0.60 0.53 0.67 0.63 0.47 0.70 0.50 0.57

Irrational decisions are explained by inference errors based on the natural language description of the payoff matrix. For example, in variant (d), the Mistral-Small model with given beliefs justifies its poor decision as follows: "Since player 2 is rational and A strictly dominates B, player 2 will choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y (40). Therefore, choosing Y maximizes my gain."

In summary, the results indicate that GPT-4.5 and Mistral-Small generally adopt first- and second-order rational behaviours. However, GPT-4.5 struggles to distinguish an optimal action from a nearly optimal one, while Mistral-Small encounters difficulties with implicit beliefs. Llama3 generates strategies randomly but adapts better when producing specific actions. In contrast, DeepSeek-R1 fails to provide valid strategies and generates irrational actions.

Beliefs

Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents.

To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider a simplified version of the Rock-Paper-Scissors (RPS) game where:

  • the opponent follows a hidden strategy, i.e., a repetition model;
  • the player must predict the opponent's next move (Rock, Paper, or Scissors);
  • a correct prediction earns 1 point, while an incorrect one earns 0 points;
  • the game can be played for
    N
    rounds, and the player's accuracy is evaluated at each round.

For our experiments, we consider three simple models for the opponent where:

  • the actions remain constant in the form of R, S, or P, respectively;
  • the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
  • the opponent's actions follow a three-step loop model (R-P-S). We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round.

Figures presents the average points earned per round and the 95% confidence interval for each LLM against the three opponent behaviour models in the simplified version of the RPS game, whether the LLM generates a strategy or one-shot actions. We observe that the performance of LLMs in action generation, except for Mistral-Small when facing a constant strategy, is barely better than a random strategy. The strategies generated by the GPT-4.5 and Mistral-Small models predict the opponent's next move based on previous rounds by identifying the most frequently played move. While these strategies are effective against an opponent with a constant behavior, they fail to predict the opponent's next move when the latter adopts a more complex model. Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy.

Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)

Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)

Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)

To assess the agents’ ability to factor the prediction of their opponent’s next move into their decision-making, we analyse their performance of each generative agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, and a loss 0 points.

Figures below illustrates the average points earned per round along with the 95 % confidence interval for each LLM when facing constant strategies, whether the model generates a full strategy or one-shot actions. The results show that LLMs’ performance in action generation against a constant strategy is only marginally better than a random strategy. While Mistral-Small can accurately predict its opponent’s move, it fails to integrate this belief into its decision-making process.

Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)

Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)

Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)

In summary, generative autonomous agents struggle to anticipate or effectively incorporate other agents’ actions into their decision-making.

Synthesis

Our results show that Mistral-Small exhibits the highest level of economic rationality, while Llama 3 shows moderate consistency, and DeepSeek-R1 remains highly inconsistent. GPT-4.5, Llama3, and Mistral-Small generally respect preferences but encounter more difficulties in generating one-shot actions than in producing strategies in the form of algorithms. GPT-4.5 and Mistral-Small generally adopt rational behaviours of both first and second order, whereas Llama3, despite generating random strategies, adapts better when producing one-shot actions. In contrast, DeepSeek-R1 fails to develop valid strategies and performs poorly in generating actions that align with preferences or rationality principles. More critically, all the LLMs we evaluated struggle both to anticipate other agents’ actions or to integrate them effectively into their decision-making process.

Authors

Maxime MORGE

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.