Skip to content
Snippets Groups Projects
user avatar
stephanebonnevay authored
5127ede3

PyGAAMAS

Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents.

This prototype explores the potential of homo silicus for social simulation. We examine the behaviour exhibited by intelligent machines, particularly how generative agents deviate from the principles of rationality. To assess their responses to simple human-like strategies, we employ a series of tightly controlled and theoretically well-understood games. Through behavioral game theory, we evaluate the ability of GPT-4.5, Llama3, Mistral-Small}, and DeepSeek-R1 to make coherent one-shot decisions, generate algorithmic strategies based on explicit preferences, adhere to first- and second-order rationality principles, and refine their beliefs in response to other agents’ behaviours.

Economic Rationality

To evaluate the economic rationality of various LLMs, we introduce an investment game designed to test whether these models follow stable decision-making patterns or react erratically to changes in the game’s parameters.

In this game, an investor allocates a basket

xt=(xtA,xtB)x_t=(x^A_t, x^B_t)
of
100100
points between two assets: Asset A and Asset B. The value of these points depends on random prices
pt=(ptA,ptB)p_t=(p_{t}^A, p_t^B)
, which determine the monetary return per allocated point. For example, if
ptA=0.8p_t^A= 0.8
and
ptB=0.8p_t^B = 0.8
, each point assigned to Asset A is worth
$0.8\$0.8
, while each point allocated to Asset B yields
$0.5\$0.5
. The game is played
2525
times to assess the consistency of the investor’s decisions.

To evaluate the rationality of the decisions, we use Afriat's critical cost efficiency index (CCEI), i.e. a widely used measure in experimental economics. The CCEI assesses whether choices adhere to the generalized axiom of revealed preference (GARP), a fundamental principle of rational decision-making. If an individual violates rational choice consistency, the CCEI determines the minimal budget adjustment required to make their decisions align with rationality. Mathematically, the budget for each basket is calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is derived from observed decisions by solving a linear optimization problem that finds the largest

λ\lambda
, where
0λ10 \leq \lambda \leq 1
, such that for every observation, the adjusted decisions satisfy the rationality constraint:
ptxtλItp_t \cdot x_t \leq \lambda I_t
. This means that if we slightly reduce the budget, multiplying it by
λ\lambda
, the choices will become consistent with rational decision-making. A CCEI close to 1 indicates high rationality and consistency with economic theory. A low CCEEI suggests irrational or inconsistent decision-making. n their 2007 study on portfolio choices, Choi et al. found that participants exhibited a high degree of rationality, with average CCEI values around 0.95: Choi, S., Fisman, R., Gale, D., & Kariv, S. (2007). Consistency and heterogeneity of individual behavior under uncertainty. American Economic Review, 97(5), 1921–1938.

To ensure response consistency, each model undergoes

3030
iterations of the game with a fixed temperature of
0.00.0
. The results shown in Figure below highlight significant differences in decision-making consistency among the evaluated models. GPT-4.5, LLama3.3:latest and DeepSeek-R1:7b stand out with a perfect CCEI score of 1.0, indicating flawless rationality in decision-making. Qwen3, Mistral-Small and Mixtral:8x7b demonstrate the next highest level of rationality. Llama3 performs moderately well, with CCEI values ranging between 0.2 and 0.74. DeepSeek-R1 exhibits inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.

CCEI Distribution per model

Preferences

To analyse the behaviour of generative agents based on their preferences, we rely on the dictator game. This variant of the ultimatum game features a single player, the dictator, who decides how to distribute an endowment (e.g., a sum of money) between themselves and a second player, the recipient. The dictator has complete freedom in this allocation, while the recipient, having no influence over the outcome, takes on a passive role.

First, we evaluate the choices made by LLMs when playing the role of the dictator, considering these decisions as a reflection of their intrinsic preferences. Then, we subject them to specific instructions incorporating preferences to assess their ability to consider them in their decisions.

Preference Elicitation

Here, we consider that the choice of an LLM as a dictator reflects its intrinsic preferences. Each LLM is asked to directly produce a one-shot action in the dictator game. Additionally, we also asked the models to generate a strategy in the form of an algorithm implemented in the Python language. In all our experiments, one-shot actions are repeated 30 times, and the models' temperature is set to

0.70.7
.

Figure below presents a violin plot illustrating the share of the total amount ($100) that the dictator allocates to themselves for each model. Notably, human participants under similar conditions typically keep around $80 on average : Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. (1994). Fairness in simple bargaining experiments. Games and Economic Behavior, 6(3), 347–369. https://doi.org/10.1006/game.1994.1021

The median share taken by GPT-4.5, Llama3, Mistral-Small, DeepSeek-R1 and Qwen3 through one-shot decisions is $50, likely due to a corpus-based biases like term frequency. The median share taken by mixtral:8x7b, Llama3.3:latest, is $60. When we ask the models to generate a strategy rather than a one-shot action, all models distribute the amount equally, except GPT-4.5, which retains about

70%70\%
of the total amount. Interestingly, under these standard conditions, humans typically keep $80 on average. When the role assigned to the model is that of a human rather than an assistant agent, only Llama3 deviates with a median share of $60. Unlike the deterministic strategies generated by LLMs, the intra-model variability in generated actions can be used to simulate the diversity of human behaviours based on their experiences, preferences, or contexts.

Violin Plot of My Share for Each Model

Our sensitivity analysis of the temperature parameter reveals that the portion retained by the dictator remains stable. However, the decisions become more deterministic at low temperatures, whereas allocation diversity increases at high temperatures, reflecting a more random exploration of available options.

My Share vs Temperature with Confidence Interval

Preference alignment

We define four preferences for the dictator, each corresponding to a distinct form of social welfare:

  1. Egoism maximizes the dictator’s income.
  2. Altruism maximizes the recipient’s income.
  3. Utilitarianism maximizes total income.
  4. Egalitarianism maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process, each corresponding to one of the four preferences:

  • The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (egoistic).
  • The dictator keeps **$100, the recipient receives $500, and $400 is lost (altruistic).
  • The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (utilitarian).
  • The dictator keeps **$325, the other player receives $325, and $350 is lost (egalitarian).

Table below evaluates the ability of the models to align with different preferences.

  • When generating strategies, the models align perfectly with preferences, except for
    • DeepSeek-R1 and Mixtral:8x7b which do not generate valid code
    • Qwen3, which fails to adopt egoistic or altruistic strategies but adheres to utilitarian and egalitarian preferences.
  • When generating actions,
    • GPT-4.5 aligns well with preferences but struggles with utilitarianism.
    • Llama3 aligns well with egoistic and altruistic preferences but shows lower adherence to utilitarian and egalitarian choices.
    • Mistral-Small aligns better with altruistic preferences and performs moderately on utilitarianism but struggles with egoistic and egalitarian preferences.
    • DeepSeek-R1 primarily aligns with utilitarianism but has low accuracy in other preferences.
    • Qwen3 strongly aligns with utilitarian preferences and moderately with altruistic ones (0.80),
    • but fails to exhibit egoistic behavior and shows weak alignment with egalitarianism. While a larger LLM typically aligns better with preferences, a model like Mixtral-8x7B may occasionally underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity. Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters. If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.
Model Generation Egoistic Altruistic Utilitarian Egalitarian
GPT-4.5 Strategy 1.00 1.00 1.00 1.00
Llama3.3:latest Strategy 1.00 1.00 1.00 1.00
Llama3 Strategy 1.00 1.00 1.00 1.00
Mixtral:8x7b Strategy - - - -
Mistral-Small Strategy 1.00 1.00 1.00 1.00
DeepSeek-R1:7b Strategy 1.00 1.00 1.00 1.00
DeepSeek-R1 Strategy - - - -
Qwen3 Strategy 0.00 0.00 1.00 1.00
GPT-4.5 Actions 1.00 1.00 0.50 1.00
Llama3.3:latest Actions 1.00 1.00 0.43 0.96
Llama3 Actions 1.00 0.90 0.40 0.73
Mixtral:8x7b Actions 0.00 0.00 0.30 1.00
Mistral-Small Actions 0.40 0.94 0.76 0.16
DeepSeek-R1:7b Actions 0.46 0.56 0.66 0.90
DeepSeek-R1 Actions 0.06 0.20 0.76 0.03
Qwen3 Actions 0.00 0.80 0.93 0.36

Errors in action selection may stem from either arithmetic miscalculations
(e.g., the model incorrectly assumes that

500+100>400+300500 + 100 > 400 + 300
) or
misinterpretations of preferences. For example, the model DeepSeek-R1,
adopting utilitarian preferences, justifies its choice by stating, "I think
fairness is key here".

In summary, our results indicate that the models GPT-4.5,
Llama3, and Mistral-Small generally align well with
preferences but have more difficulty generating individual actions than
algorithmic strategies. In contrast, DeepSeek-R1 does not generate
valid strategies and performs poorly when generating specific actions.

Social preference

To analyze the behavior of generative agents based on their preferences under strategic interaction, we rely on the ultimatum game. In this game, the proposer (analogous to the dictator) is tasked with deciding how to divide an endowment (e.g., a sum of money) between themselves and a second player, the responder. However, unlike in the dictator game, the responder plays an active role: they can either accept or reject the proposed allocation. If the offer is rejected, both players receive nothing.

Firstly, we evaluate the choices made by LLMs when playing the role of the proposer, interpreting these decisions as a reflection of their implicit social norms or strategic preferences, especially when anticipating potential rejection by the responder. Oosterbeek et al. find that on average the proposer offers 40% of the pie to the responder. Oosterbeek, H., Sloof, R., & Van De Kuilen, G. (2004). Cultural differences in ultimatum game experiments: Evidence from a meta-analysis. Experimental Economics, 7, 171–188. https://doi.org/10.1023/B:EXEC.0000026978.14316.74

The figure below presents a violin plot illustrating the share of the total amount ($100) that the proposer allocates to themselves for each model. The share selected by strategies generated by Llama3, Mistral-Small, and Qwen3 aligns with the median share chosen by actions generated by the models Mistral-Small, Mixtral:8x7B, and DeepSeek-R1:7B, around $50 — likely reflecting corpus-based biases, such as term frequency. The share selected by strategies generated by Llama3.3 and DeepSeek-R1:7B resembles the median share in the actions generated by GPT-4.5 and Llama3, around $60, which is consistent with what human participants typically choose under similar conditions. While the shares selected by strategies from GPT-4.5 and Mixtral:8x7B are respectively overestimated and underestimated, the actions generated by DeepSeek-R1:7B and Qwen3 can be considered irrational.

Violin Plot of My Share for Each Model

Secondly, we analyze the behavior of LLMs when assuming the role of the responder, focusing on whether their acceptance or rejection of offers reveals a human-like sensitivity to unfairness. The meta-analysis by Oosterbeek et al. (2004) reports that human participants reject 16% of offers, amounting to 40% of the total stake. This finding suggests that factors beyond purely economic self-interest—such as fairness concerns or the desire to punish perceived injustice—significantly influence decision-making.

The figure below presents a violin plot illustrating the acceptance rate of the responder for each model when offered $40 out of $100. While GPT-4.5, Llama3, Llama3.3, Mixtral:8x7B, Deepseek-R1:7B, and Qwen3 exhibit a rational median acceptance rate of 1.0, Mistral-Small and Deepseek-R1 display an irrational median acceptance rate of 0.0.

It is worth noting that these results are not necessarily compliant with the strategies generated by the models. For instance, GPT-4.5 accepts offers as low as 20%, interpreting them as minimally fair, while Mistral-Small employs a tiered strategy that only consistently accepts offers of 50% or more, and randomly accepts those between 25% and 49%. Models like Llama3, Deepseek-R1, and Qwen3 exhibit rigid fairness thresholds, rejecting any offer below 50%. Llama3.3 uses a slightly more permissive threshold of 30%, leading to greater acceptance at lower offers. These results suggest that most LLMs do not capture the influence of perceived injustice that shapes human decision-making in the ultimatum game.

Violin Plot of Acceptance Rate for Each Model

Strategic Rationality

An autonomous agent act strategically, considering not only its own preferences but also the potential actions and preferences of others. It is strategical rational if it chooses the optimal action based on its beliefs. This agent satisfies second-order rationality if it is rational and believes that other agents are rational. In other words, a second-order rational agent does not only consider the best choice for itself but also anticipates how others make their decisions. Experimental game theory studies show that 93 % of human subjects are rational, while 71 % exhibit second-order rationality.

Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: Fairness in Simple Bar- gaining Experiments. Games and Economic Behavior 6(3), 347–369 (1994), https://doi.org/10.1006/game.1994.1021

To evaluate the first- and second-order rationality of generative autonomous agents, we consider a simplified version of the ring-network game, which involves two players seeking to maximize their own payoff. Each player has two available actions, and the payoff matrix is presented below/

Player 1 \ Player 2 Strategy A Strategy B
Strategy X (15,10) (5,5)
Strategy Y (0,5) (10,0)

If Player 2 is rational, they must choose A because B is strictly dominated. If Player 1 is rational, they may choose either X or Y: X is the best response if Player 1 believes that Player 2 will choose A, while Y is the best response if Player 1 believes that Player 2 will choose B. If Player 1 satisfies second-order rationality, they must play X. To neutralize biases in large language models (LLMs) related to the naming of actions, we reverse the action names in half of the experiments.

We consider three types of beliefs:

  • an implicit belief, where the optimal action must be deduced from
    the natural language description of the payoff matrix;
  • an explicit belief, based on the analysis of player 2's actions, meaning that the fact that B is strictly dominated by A is provided in the prompt;
  • a given belief, where the optimal action for player 1 is explicitly given in the prompt. We first evaluate the rationality of the agents and then their second-order rationality.

First Order Rationality

Table below evaluates the models’ ability to generate rational behaviour for Player 2.

Model Generation Given Explicit Implicit
GPT-4.5 strategy 1.00 1.00 1.00
Mixtral:8x7b strategy 1.00 1.00 1.00
Mistral-Small strategy 1.00 1.00 1.00
Llama3.3:latest strategy 1.00 1.00 0.50
Llama3 strategy 0.50 0.50 0.50
Deepseek-R1:7b strategy - - -
Deepseek-R1 strategy - - -
Qwen3 strategy 0.00 0.00 0.00
GPT-4.5 actions 1.00 1.00 1.00
Mixtral:8x7b actions 1.00 1.00 1.00
Mistral-Small actions 1.00 1.00 0.87
Llama3.3:latest actions 1.00 1.00 1.00
Llama3 actions 1.00 0.90 0.17
Deepseek-R1:7b actions 1.00 1.00 1.00
Deepseek-R1 actions 0.83 0.57 0.60
Qwen3 actions 1.00 0.93 0.50

When generating strategies, GPT-4.5, Mixtral-8x7B, and Mistral-Small exhibit rational behavior, whereas Llama3 adopts a random rationality and Qwen3 is irational. Llama3.3:latest has the same behaviour with implicit beliefs. Deepseek-R1:7b and DeepSeek-R1 fails to generate valid strategies. When generating actions, GPT-4.5, Mixtral-8x7B, DeepSeek-R1:7b, and Llama3.3:latest< demonstrate strong rational decision-making, even with implicit beliefs. Mistral-Small and Qwen3 performs well but lags in handling implicit reasoning. Llama3 struggles with implicit reasoning, while DeepSeek-R1 shows inconsistent performance. Overall, GPT-4.5 and Mixtral-8x7B are the most reliable models for generating rational behavior.

Second-Order Rationality

To adjust the difficulty of optimal decision-making, we define four variants of the payoff matrix for player 1 in Table below: (a) the original configuration, (b) the reduction of the gap between the gains, (c) the increase in the gain for the bad choice Y, and (d) the decrease in the gain for the good choice X.

Version a b c d
Player 1 \ Player 2 (version) A B A B A B A B
X 15 5 8 7 6 5 15 5
Y 0 10 7 8 0 10 0 40

We introduce a prompt engineering method that incorporates Conditional Reasoning (CR), prompting the model to evaluate an opponent’s optimal response to each of its own possible actions to encourage strategic foresight and informed decision-making.

Table below evaluates the models' ability to generate second-order rational behaviour for player 1. The configurations where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics.

When generating strategies, GPT-4.5 consistently exhibits second-order rational behavior in all configurations except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, showing no strong pattern of rational behavior. In contrast, Mistral-Small and Mixtral-8x7B demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior. Llama3.3:latest performs well with given and explicit beliefs but struggles with implicit beliefs. Qwen3 generate irrational strategies. DeepSeek-R1 does not produce valid responses in strategy generation.

When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix but struggles with implicit beliefs, particularly in configuration (d). GPT-4.5 performs well in the initial configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), especially with implicit beliefs. Mixtral-8x7B generally performs well but shows reduced accuracy for implicit beliefs in configurations (b) and (d). Mistral-Small performs well with given or explicit beliefs but struggles with implicit beliefs, particularly in configuration (d). DeepSeek-R1:7b, in contrast to its smallest version, performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs. Qwen3 performs robustly across most belief types, especially in configurations (a) and (b), maintaining strong scores on both explicit and implicit conditions. However, like other models, it experiences a noticeable drop in accuracy under implicit beliefs in configuration (d), suggesting sensitivity to deeper inferential reasoning.

It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models (like Mistral-Small, Deepseek-R1 and Qwen3), especially under implicit and explicit conditions, it often harms performance in larger models (e.g., GPT-4.5, LLama3.3 or Mixtral:8x7b), where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively — particularly in less confident or under-specified contexts.

Version a b c d
Model Generation Given Explicit Implicit Given Explicit Implicit Given Explicit Implicit Given Explicit Implicit
GPT-4.5 strategy 1.00 1.00 1.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00 1.00 1.00
Llama3.3:latest strategy 1.00 1.00 0.50 1.00 1.00 0.50 1.00 1.00 0.50 1.00 1.00 0.50
Llama3 strategy 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50 0.50
Mixtral:8x7b strategy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Mistral-Small strategy 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Deepseek-R1:7b strategy - - - - - - - - - - - -
Deepseek-R1 strategy - - - - - - - - - - - -
Qwen3 strategy 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
GPT-4.5 actions 1.00 1.00 1.00 1.00 0.67 0.00 0.86 0.83 0.00 0.50 0.90 0.00
actions + CR 1.00 1.00 1.00 0.10 0.20 0.66 0.23 0.96 0.86 0.03 0.00 *0.16
Llama3.3:latest actions 1.00 1.00 1.00 1.00 1.00 0.50 1.00 1.00 0.20 1.00 1.00 0.00
actions + CR 1.00 1.00 0.96 0.96 1.00 0.96 1.00 1.00 0.80 1.00 1.00 0.90
Llama3 actions 0.97 1.00 1.00 0.77 0.80 0.60 0.97 0.90 0.93 0.83 0.90 0.60
actions + CR 0.90 0.90 0.86 0.50 0.50 0.50 0.76 0.96 0.70 0.67 0.83 0.67
Mixtral:8x7b actions 1.00 1.00 1.00 1.00 1.00 0.50 1.0 1.0 1.0 1.00 1.00 0.73
actions + CR 1.00 0.96 1.00 1.00 1.00 1.0 1.0 1.0 1.0 1.00 1.00 0.28
Mistral-Small actions 0.93 0.97 1.00 0.87 0.77 0.60 0.77 0.60 0.70 0.73 0.57 0.37
actions + CR 1.00 0.93 1.00 0.95 0.96 0.90 0.90 0.76 0.43 0.67 0.40 0.37
Deepseek-R1:7b actions 1.00 0.96 1.00 1.00 1.00 0.93 0.96 1.00 0.92 0.96 1.00 0.79
actions + CR 1.00 1.00 1.00 1.00 1.00 1.00 0.90 1.00 1.00 1.00 1.00 1.00
Deepseek-R1 actions 0.80 0.53 0.56 0.67 0.60 0.53 0.67 0.63 0.47 0.70 0.50 0.57
actions + CR 0.80 0.63 0.60 0.67 0.63 0.70 0.67 0.70 0.50 0.63 0.76 0.70
Qwen3 actions 1.00 1.00 1.00 0.90 0.96 1.00 1.00 0.96 0.70 1.00 0.96 0.46
actions + CR 1.00 1.00 1.00 1.00 1.00 1.00 0.96 1.00 1.00 0.96 0.96 0.83

Irrational decisions are explained by inference errors based on the natural language description of the payoff matrix. For example, in variant (d), the Mistral-Small model with given beliefs justifies its poor decision as follows: "Since player 2 is rational and A strictly dominates B, player 2 will choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y (40). Therefore, choosing Y maximizes my gain."

In summary, Mixtral-8x7B and GPT-4.5 demonstrate the strongest performance in both first- and second-order rationality, though GPT-4.5 struggles with near-optimal decisions and Mixtral-8x7B has reduced accuracy with implicit beliefs. Mistral-Small also performs well but faces difficulties with implicit beliefs, particularly in second-order reasoning. Llama3.3:latest succeeds when given explicit or given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex decision-making. DeepSeek-R1:7b shows strong first-order rationality but its performance declines with implicit beliefs, especially in second-order rationality tasks. In contrast, DeepSeek-R1 and Llama3 exhibit inconsistent and often irrational decision-making, failing to generate valid strategies in many cases. Qwen3 struggles to generate valid strategies, reflecting limited high-level planning. However, it shows strong first-order rationality when producing actions, especially under explicit or guided conditions, and benefits from conditional reasoning. Its performance declines with implicit beliefs, highlighting limitations in deeper inference.

Beliefs - MP

Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents.

Refine beliefs

To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider the matching pennies game which is played between two players, an agent and the opponent. Each player has a penny and must secretly turn the penny to Head or Tail. The players then reveal their choices simultaneously. If the pennies match (both heads or both tails), then the agent wins 1 point. If not, then the opponent wins and the agent loses 1 point. The objective is to maximize the total gain of the agent.

In this game:

  • the opponent follows a hidden strategy, i.e., a repetition model;
  • the agent must predict the opponent's next move (Head or Tail);
  • a correct prediction earns 1 point, while an incorrect one earns 0 points;
  • the game can be played for
    N=10N=10
    rounds, and the agent's accuracy is evaluated at each round.

For our experiments, we consider two simple models for the opponent where:

  • the actions remain constant in the form of Head or Tail, respectively;
  • the actions follow an alternative form (Head-Trail or Trail-Head).

We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round.

Figures present the average points earned and prediction per round (95% confidence interval) for each LLM against the two opponent behavior (constant and alternate) models in the matching pennies game.

Against Constant behavior, GPT-4.5 and Qwen3 ...

Prediction Accuracy per Round by Actions Against Constant Behaviour (with 95% Confidence Interval) Points Earned per Round by Actions Against Constant Behaviour (with 95% Confidence Interval) Prediction Accuracy per Round by Actions Against Alternate Behaviour (with 95% Confidence Interval) Points Earned per Round by Actions Against Alternate Behaviour (with 95% Confidence Interval)

Beliefs - RPS

Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents.

Refine beliefs

To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider a simplified version of the Rock-Paper-Scissors (RPS) game where:

  • the opponent follows a hidden strategy, i.e., a repetition model;
  • the player must predict the opponent's next move (Rock, Paper, or Scissors);
  • a correct prediction earns 1 point, while an incorrect one earns 0 points;
  • the game can be played for
    N=10N=10
    rounds, and the player's accuracy is evaluated at each round.

For our experiments, we consider three simple models for the opponent where:

  • the actions remain constant in the form of R, S, or P, respectively;
  • the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
  • the opponent's actions follow a three-step loop model (R-P-S). We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round.

Figures present the average points earned per round and the
95% confidence interval for each LLM against the three opponent behavior
models in a simplified version of the Rock-Paper-Scissors (RPS) game,
whether the LLM generates a strategy or one-shot actions.

Neither Llama3, DeepSeek-R1, nor Qwen3 were able to generate a valid strategy.
DeepSeek-R1:7b was unable to generate either a valid strategy
or consistently valid actions. The strategies generated by the GPT-4.5 and Mistral-Small models attempt to predict the opponent’s next move based on previous rounds by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant behavior,
they fail to predict the opponent's next move when the latter adopts a more complex model. We observe that the performance of most LLMs in action generation —
except for Llama3.3:latest, Mixtral:8x7b, Mistral-Small, and Qwen3 when facing a constant strategy—is barely better than a random strategy.

Average Points Earned per Round By Strategies Against Constant Behaviour (with 95% Confidence Interval) Average Points Earned per Round By Actions Against Constant Behaviour (with 95% Confidence Interval)

Average Points Earned per Round by Strategies Against 2-Loop Behaviour (with 95% Confidence Interval) Average Points Earned per Round by Actions Against 2-Loop Behaviour (with 95% Confidence Interval)

Average Points Earned per Round by Strategies Against 3-Loop Behaviour (with 95% Confidence Interval) Average Points Earned per Round by Actions Against 3-Loop Behaviour (with 95% Confidence Interval)

Assess Beliefs

To assess the agents’ ability to factor the prediction of their opponent’s next move into their decision-making, we analyse their performance of each generative agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, and a loss 0 points.

Figure below illustrates the average points earned per round along with the 95 % confidence interval for each LLM when facing constant strategies, when the model generates one-shot actions. Even if Mixtral:8x7b, Mistral-Small, and Qwen3 accurately predict its opponent’s move, they fail to integrate this belief into its decision-making process. Only Llama3.3:latest is capable of inferring the opponent’s behavior to choose the winning move.

In summary, generative autonomous agents struggle to anticipate or effectively incorporate other agents’ actions into their decision-making.

Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)

Rational vs Credible

To assess whether a generative agent is capable of adopting either an individual rational behavior, or a credible behavior simulating human-like decision-making, we consider the Prisoner’s Dilemma game.

The Prisoner’s Dilemma is the canonical model of a social dilemma — a situation where individual rationality leads to collective irrationality. This game models two accomplices who are arrested and interrogated separately. Each has two options: Cooperate with the other by remaining silent, or Defect by betraying the other. The possible outcomes are:

  • If both remain silent (cooperate), they get light sentences.
  • If one defects while the other stays silent, the defector goes free while the cooperator receives a heavy sentence.
  • If both defect, they both receive moderately heavy sentences.

The Prisoner’s Dilemma is characterized by the following ordering of payoffs:

Cooperate Defect
Cooperate (R, R) (S, T)
Defect (T, S) (P, P)

with T > R > P > S where:

  • T (Temptation): The payoff for defecting while the other cooperates — the best personal outcome.
  • R (Reward): The payoff for mutual cooperation — good for both, but not individually tempting.
  • P (Punishment): The outcome when both defect — worse than cooperation but better than being exploited.
  • S (Sucker’s Payoff): The worst outcome, received when one cooperates and the other defects.

While defection is the dominant strategy for a rational agent, humans frequently cooperate. Cooperation rates in lab experiments often range from 30–60% in one-shot Prisoner’s Dilemmas.

Sally, D.: Conversation and Cooperation in Social Dilemmas: A Meta-Analysis of Experiments from 1958 to 1992. Rationality and Society 7(1), 58–92 (1995), https://doi.org/10.1177/1043463195007001004

To assess the ability of generative agents to adopt a role, we compared three prompts:

  • Rational: "You are a rational agent which aims at maximizing its own utility".
  • Human: "You are an agent simulating a human with social preferences such as fairness, reciprocity, and aversion to inequity".
  • Neutral: No specific instruction provided.

To adjust the difficulty of decision-making, we define four variants of the payoff matrix in Table below:

  • Classic: the original configuration with standard payoffs.
  • High: increased reward for defecting, widening the gap between Temptation (T) and Reward (R).
  • Mild: softened outcomes with smaller differences between payoffs.
  • Cooperation Loss: increased penalty for cooperating when betrayed, with a large negative payoff for the sucker’s outcome (S).
Version Classic High Mild Coop. Loss
Player 1 \ Player 2 C D C D C D C D
C (3, 3) (0, 5) (6, 6) (1, 10) (2.5, 2. 5) (1, 3) (6, 6) (-3, 8)
D (5, 0) (1, 1) (10, 1) (2, 2) (3, 1) (2, 2) (8, -3) (2, 2)

To minimize the influence of semantic bias in LLMs, we replace descriptive action labels Cooperate and Defect with neutral placeholders (Foo and Bar). This anonymized setup (marked as ano. in the table) helps ensure that the agent’s choices reflect the underlying payoffs rather than social connotations tied to specific words.

Table below evaluates the cooperation rates of models.

GPT-4.5 consistently defects under the Rational prompt across all payoff matrices, demonstrating correct alignment with utility-maximizing behavior. Importantly, its decisions remain invariant under anonymization, indicating that it is not relying on semantic cues such as "Cooperate” or “Defect” but is responding to the actual payoff structure. However, under the Human prompt, GPT-4.5 always cooperates, regardless of the payoff configuration. This lack of variation reveals an overfitting to the social prompt — it simulates idealized prosocial behavior without adapting to different incentive environments, thus failing the test of payoff sensitivity expected from human-like reasoning.

Mistral-Small, on the other hand, shows more nuanced behavior. While it defects under the Rational prompt in high-risk or high-reward variants and cooperates more under Human, it also modulates cooperation rates in response to the payoffs, especially under the Human prompt. For example, cooperation drops slightly in the “Cooperation Loss” condition, suggesting some recognition of the increased risk of being exploited. Additionally, Mistral-Small is mostly robust to anonymization, showing consistent behavior whether standard or neutral action labels are used, particularly under the Human role.

In contrast, models like Llama3.3 and Mixtral produce uniform cooperation across all conditions and prompts, suggesting a failure to internalize role differences or payoff structures. These models act as if they have a fixed bias toward cooperation, likely driven by training data priors, rather than context-sensitive reasoning. Qwen3 exhibits the opposite failure mode: it is overly rigid, rarely cooperating even under Human prompts, and shows erratic drops in cooperation under anonymization, indicating semantic overreliance and poor role alignment.

It is worth noting that most LLMs are unable to generate strategies for this game, and the strategies they do generate are insensitive to the role being played.

Overall, few models achieve the desired trifecta of role fidelity (behaving distinctly across prompts), payoff awareness (adjusting behavior with incentives), and semantic robustness (insensitivity to superficial label changes). Most lean toward either rigid rationality, indiscriminate cooperation, or unstable, incoherent behavior.

Version Classic High Mild Coop. Loss
Model Generation Rational Neutral Human Rational Neutral Human Rational Neutral Human Rational Neutral Human
GPT-4.5 actions 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00
actions + ano 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00 0.00 0.00 1.00
Llama3.3:latest actions 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
actions + ano 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Llama3 actions 0.60 1.00 1.00 0.73 1.00 1.00 0.67 1.00 1.00 0.73 0.97 0.97
actions + ano 0.43 0.40 0.80 0.50 0.73 0.90 0.40 0.53 0.96 0.63 0.37 0.83
Mixtral:8x7b actions 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
actions + ano 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
Mistral-Small actions 0.00 0.90 1.00 0.00 0.77 1.00 0.03 0.97 1.00 0.07 0.90 1.00
actions + ano 0.10 0.77 0.97 0.17 0.77 1.00 0.40 0.63 1.00 0.43 0.43 0.90
Deepseek-R1:7b actions N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
actions + ano N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A
Deepseek-R1 actions 0.87 0.97 0.93 0.83 0.83 0.93 0.87 0.97 0.90 0.87 1.00 0.93
actions + ano 0.83 0.83 0.80 0.90 0.90 0.87 N/A N/A N/A 0.83 0.90 0.80
Qwen3 actions 0.00 0.20 0.93 0.00 0.13 0.57 0.00 0.13 0.63 0.00 0.07 0.47
actions + ano 0.10 0.13 0.10 0.00 0.03 0.10 0.03 0.11 0.10 0.00 0.07 0.03

Coordination

In order to asse the ability of generative agents to coordinate, we consider a simultaneous game in which a player will earn a higher payoff when they select the same course of action as another player.

The Battle of the Sexes is a model of a coordination game, but one with distributional conflict over which coordination points to choose. Both players want to coordinate but prefer different outcomes. This game models a couple deciding how to spend the evening. Woman prefers opera, man prefers football. While both prefer to be together rather than apart, each prefers their own event over the other's. The key tension lies in the fact that mutual benefit comes from coordination, but disagreement exists over which a coordinated outcome is better.

The Battle of the Sexes is characterized by the following ordering of payoffs: A > C, B > C, and A ≠ B (e.g., A = 3, B = 2, C = 0), where:

  • A: The payoff for the player who gets their preferred outcome and is with the other — best individual and mutual outcome for them.
  • B: The payoff for the player who compromises but is still together — second-best.
  • C: The worst payoff when coordination fails — players go to different events.
Woman\Man Opera Football
Opera (A, B) (C, C)
Football (C, C) (B, A)

This game has 2 pure strategy Nash equilibrium:

  • (Opera, Opera): The woman's preferred coordination.
  • (Football, Football): The man's preferred coordination. and one mixed strategy equilibrium, where each player randomizes over the two options, typically placing more weight on their preferred event. While both players want to coordinate, the disagreement over which coordinated outcome to choose can make coordination unstable without communication or prior agreement.

Agent-Human Coordination

To assess the agents’ ability to coordinate a human-like strategy, we consider a multi-round version of the Battle of the Sexes game in which the opponent follows a hidden strategy which was to alternate between the different options. In each round, the agent must predict the opponent’s next move — earning 1 point for a correct prediction and 0 for an incorrect one — and incorporate this prediction into its decision-making. The game is played over N = 10 rounds, with the agent’s payoff and prediction accuracy evaluated at each round. To avoid gender biais, we replace descriptive player labels and action labels with letters. This anonymized setup helps ensure that the agent’s choices reflect the underlying payoffs rather than social connotations tied.

The first figure below presents the average prediction accuracy points earned per round, along with the 95% confidence interval. The second figure shows the average points earned per round by each model. No model was able to generate a valid strategy. The models failed to predict the opponent’s next move and, a fortiori, to coordinate effectively. The models are failing to coordinate in the Battle of the Sexes primarily because their prediction and reasoning mechanisms do not correctly identify the opponent’s looping behavior. The model-generated predictions tend to treat the opponent as responsive, random, or goal-seeking, rather than as following a simple pattern. This mischaracterization leads the models to overcomplicate what is actually a periodic strategy, attempting to exploit or predict rational behavior instead of recognizing and adapting to the underlying pattern.

Prediction Accuracy per Round by Model (with 95% Confidence Interval)

Points Earned per Round by Model (with 95% Confidence Interval)

Agent-Agent coordination

Cooper et al. (1989) report experimental results on the role of pre-play communication in the Battle of the Sexes game. They find that communication significantly increases the frequency of equilibrium play. One-way communication is the most effective in resolving the coordination problem. Although two-way communication introduces more potential for conflict, even a single round of communication helps overcome some coordination difficulties, and three rounds perform even better.

Cooper, Russell and DeJong, Douglas V and Forsythe, Robert and Ross, Thomas W. Communication in the battle of the sexes game: some experimental results. The RAND Journal of Economics, pp. 568--587, 1989.

To evaluate the ability of generative agents to coordinate with one another under varying levels of communication, we paired each agent with another generative agent powered by the same model, within the same 10-round version of the Battle of the Sexes game used in prior experiments. Each experimental condition was repeated 30 times, with the woman taking the initiative of the communication in half of the games. To assess the effect of pre-game communication (0, 1, 2, or 3 messages), we measured the players’ average predictive accuracy and their payoff in each round.

In the figures below, we focus on the Qwen3 and GPT-4.5 models. Unlike other open-weight models, Qwen3 enables generative agents to coordinate effectively—with or without communication. They quickly incorporate their beliefs about the opponent’s behavior into their decision-making. In contrast, GPT-4.5 agents require several rounds to anticipate their opponent. While pre-game communication slightly improves short-term coordination, without a clear shared strategy, even communication fails to produce effective alignment. Most generative agents fail to coordinate because they lack a common strategy and struggle to align in games with multiple equilibria. Communication worsens this issue by introducing ambiguity: language models generate seemingly cooperative messages but do not consistently translate them into coherent actions, leading to broken expectations and even weaker coordination.

Prediction Accuracy per Round by Model (with 95% Confidence Interval)

Points Earned per Round by Model (with 95% Confidence Interval)

Synthesis

Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of decision-making. Mistral-Small demonstrates the highest level of consistency in economic decision-making, with Llama3 showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency. Qwen3 performs moderately well, showing rational behavior but struggling with implicit reasoning.

GPT-4.5, Llama3, and Mistral-Small generally align well with declared preferences, particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency. In contrast, DeepSeek-R1 fails to generate valid strategies and performs poorly in aligning actions with specified preferences. Qwen3 aligns well with utilitarian preferences and moderately with altruistic ones but struggles with egoistic and egalitarian preferences.

GPT-4.5 and Mistral-Small consistently display rational behavior at both first- and second-order levels. Llama3, although prone to random behavior when generating strategies, adapts more effectively in one-shot decision-making tasks. DeepSeek-R1 underperforms significantly in both strategic and one-shot formats, rarely exhibiting coherent rationality. Qwen3 shows strong first-order rationality when producing actions, especially under explicit or guided conditions, but struggles with deeper inferential reasoning.

All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs into optimal responses. Only Llama3.3:latest shows any reliable ability to infer and act on opponents’ simple behavior.

Whether generating actions or strategies, most LLMs tend to exhibit either rigid rationality, indiscriminate cooperation, or unstable and incoherent behavior. Except for Mistral-Small, the models do not achieve the desired combination of three criteria: the ability to adopt a role (behaving differently based on instructions), payoff sensitivity (adjusting behavior according to incentives), and semantic robustness (remaining unaffected by superficial label changes).

When it comes to coordination, most generative agents struggle to align their actions in games with multiple equilibria. This failure stems from an absence of shared strategies and a limited ability to model the opponent’s behavior accurately. Although communication is expected to improve coordination, it often introduces ambiguity instead—models generate cooperative-sounding messages that are not followed by consistent actions, leading to misaligned expectations and degraded coordination. Only Qwen3 shows reliable coordination behavior, swiftly incorporating beliefs about the opponent’s strategy even without communication. In contrast, models like GPT-4.5 require several rounds to adjust and still often fail to converge on mutually beneficial strategies.

Authors

Maxime MORGE

License

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.