Maxime Morge
PyGAAMAS

Repository



PyGAAMAS
Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate
the social behaviors of LLM-based agents.
This prototype explores the potential of homo silicus for social
simulation. We examine the behaviour exhibited by intelligent
machines, particularly how generative agents deviate from
the principles of rationality. To assess their responses to simple human-like
strategies, we employ a series of tightly controlled and theoretically
well-understood games. Through behavioral game theory, we evaluate the ability
of GPT-4.5, Llama3, Mistral-Small}, and
DeepSeek-R1 to make coherent one-shot
decisions, generate algorithmic strategies based on explicit preferences, adhere
to first- and second-order rationality principles, and refine their beliefs in
response to other agents’ behaviours.

Economic Rationality
To evaluate the economic rationality of various LLMs, we introduce an investment game
designed to test whether these models follow stable decision-making patterns or react
erratically to changes in the game’s parameters.
In this game, an investor allocates a basket 
 $x_t=(x^A_t, x^B_t)$ 
 of  $100$  points between
two assets: Asset A and Asset B. The value of these points depends on random prices  $p_t=(p_{t}^A, p_t^B)$ ,
which determine the monetary return per allocated point. For example, if  $p_t^A= 0.8$ 
 and  $p_t^B = 0.8$ ,
each point assigned to Asset A is worth  $\$0.8$ 
, while each point allocated to Asset B yields  $\$0.5$ .
The game is played  $25$ 
 times to assess the consistency of the investor’s decisions.
To evaluate the rationality of the decisions, we use Afriat's
critical cost efficiency index (CCEI), i.e. a widely used measure in
experimental economics. The CCEI assesses whether choices adhere to the
generalized axiom of revealed preference (GARP), a fundamental principle of
rational decision-making. If an individual violates rational choice consistency,
the CCEI determines the minimal budget adjustment required to make their
decisions align with rationality. Mathematically, the budget for each basket is
calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is
derived from observed decisions by solving a linear optimization problem that
finds the largest 
 $\lambda$ 
, where  $0 \leq \lambda \leq 1$ , such that for every
observation, the adjusted decisions satisfy the rationality constraint:  $p_t \cdot x_t \leq \lambda I_t$ . This means that if we slightly reduce the budget,
multiplying it by  $\lambda$ , the choices will become consistent with rational
decision-making. A CCEI close to 1 indicates high rationality and consistency
with economic theory. A low CCEEI suggests irrational or inconsistent
decision-making. n their 2007 study on portfolio choices, Choi et al. found
that participants exhibited a high degree of rationality, with average CCEI values
around 0.95:
Choi, S., Fisman, R., Gale, D., & Kariv, S. (2007).
Consistency and heterogeneity of individual behavior under uncertainty. American Economic Review, 97(5), 1921–1938.
To ensure response consistency, each model undergoes 
 $30$  iterations of the game
with a fixed temperature of  $0.0$ . The results shown in
Figure below highlight significant differences in decision-making
consistency among the evaluated models. GPT-4.5, LLama3.3:latest
and DeepSeek-R1:7b stand out with a
perfect CCEI score of 1.0, indicating flawless rationality in decision-making.
Qwen3, Mistral-Small and Mixtral:8x7b demonstrate the next highest level of rationality.
Llama3 performs moderately well, with CCEI values ranging between 0.2 and 0.74.
DeepSeek-R1 exhibits
inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.


Preferences
To analyse the behaviour of generative agents based on their preferences, we
rely on the dictator game. This variant of the ultimatum game features a single
player, the dictator, who decides how to distribute an endowment (e.g., a sum of
money) between themselves and a second player, the recipient. The dictator has
complete freedom in this allocation, while the recipient, having no influence
over the outcome, takes on a passive role.
First, we evaluate the choices made by LLMs when playing the role of the
dictator, considering these decisions as a reflection of their intrinsic
preferences. Then, we subject them to specific instructions incorporating
preferences to assess their ability to consider them in their decisions.

Preference Elicitation
Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
preferences. Each LLM is asked to directly produce a one-shot action in the
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the Python language. In all our
experiments, one-shot actions are repeated 30 times, and the models' temperature
is set to 
 $0.7$ 
.
Figure below presents a violin plot illustrating the share of the
total amount ($100) that the dictator allocates to themselves for each model.
Notably, human participants under similar conditions typically keep around $80 on average :
Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. (1994).
Fairness in simple bargaining experiments. Games and Economic Behavior, 6(3), 347–369.
https://doi.org/10.1006/game.1994.1021
The median share taken by GPT-4.5, Llama3,
Mistral-Small, DeepSeek-R1 and Qwen3 through one-shot decisions is
$50, likely due to a corpus-based biases like term frequency.
The median share taken by mixtral:8x7b, Llama3.3:latest,
is $60. When we ask the
models to generate a strategy rather than a one-shot action, all models
distribute the amount equally, except GPT-4.5, which retains about
 $70\%$  of the total amount. Interestingly, under these standard conditions,
humans typically keep $80 on average. When the role
assigned to the model is that of a human rather than an assistant agent, only
Llama3 deviates with a median share of $60. Unlike the deterministic strategies
generated by LLMs, the intra-model variability in generated actions can be used
to simulate the diversity of human behaviours based on their experiences,
preferences, or contexts.

Our sensitivity analysis of the temperature parameter reveals that the portion
retained by the dictator remains stable. However, the decisions become more
deterministic at low temperatures, whereas allocation diversity increases at
high temperatures, reflecting a more random exploration of available options.


Preference alignment
We define four preferences for the dictator, each corresponding to a distinct form of social welfare:


Egoism maximizes the dictator’s income.

Altruism maximizes the recipient’s income.

Utilitarianism maximizes total income.

Egalitarianism maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process,
each corresponding to one of the four preferences:

The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (egoistic).
The dictator keeps **$100, the recipient receives $500, and $400 is lost (altruistic).
The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (utilitarian).
The dictator keeps **$325, the other player receives $325, and $350 is lost (egalitarian).

Table below evaluates the ability of the models to align with different preferences.

When generating strategies, the models align perfectly with preferences, except for

DeepSeek-R1 and Mixtral:8x7b which do not generate valid code

Qwen3, which fails to adopt egoistic or altruistic strategies but adheres
to utilitarian and egalitarian preferences.


When generating actions,


GPT-4.5 aligns well with preferences but struggles with utilitarianism.

Llama3 aligns well with egoistic and altruistic preferences but shows lower adherence to utilitarian and egalitarian choices.

Mistral-Small aligns better with altruistic preferences and performs moderately on utilitarianism but struggles with egoistic and egalitarian preferences.

DeepSeek-R1 primarily aligns with utilitarianism but has low accuracy in other preferences.

Qwen3 strongly aligns with utilitarian preferences and moderately with altruistic ones (0.80),
but fails to exhibit egoistic behavior and shows weak alignment with egalitarianism.
While a larger LLM typically aligns better with preferences, a model like Mixtral-8x7B may occasionally
underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity.
Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters.
If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.


Model
Generation
Egoistic
Altruistic
Utilitarian
Egalitarian


GPT-4.5
Strategy
1.00
1.00
1.00
1.00


Llama3.3:latest
Strategy
1.00
1.00
1.00
1.00


Llama3
Strategy
1.00
1.00
1.00
1.00


Mixtral:8x7b
Strategy
-
-
-
-


Mistral-Small
Strategy
1.00
1.00
1.00
1.00


DeepSeek-R1:7b
Strategy
1.00
1.00
1.00
1.00


DeepSeek-R1
Strategy
-
-
-
-


Qwen3
Strategy
0.00
0.00
1.00
1.00


GPT-4.5
Actions
1.00
1.00
0.50
1.00


Llama3.3:latest
Actions
1.00
1.00
0.43
0.96


Llama3
Actions
1.00
0.90
0.40
0.73


Mixtral:8x7b
Actions
0.00
0.00
0.30
1.00


Mistral-Small
Actions
0.40
0.94
0.76
0.16


DeepSeek-R1:7b
Actions
0.46
0.56
0.66
0.90


DeepSeek-R1
Actions
0.06
0.20
0.76
0.03


Qwen3
Actions
0.00
0.80
0.93
0.36


Errors in action selection may stem from either arithmetic miscalculations

(e.g., the model incorrectly assumes that 
 $500 + 100 > 400 + 300$ 
) or

misinterpretations of preferences. For example, the model DeepSeek-R1,

adopting utilitarian preferences, justifies its choice by stating, "I think

fairness is key here".
In summary, our results indicate that the models GPT-4.5,

Llama3, and Mistral-Small generally align well with

preferences but have more difficulty generating individual actions than

algorithmic strategies. In contrast, DeepSeek-R1 does not generate

valid strategies and performs poorly when generating specific actions.

Social preference
To analyze the behavior of generative agents based on their preferences under strategic interaction, we rely on the
ultimatum game. In this game, the proposer (analogous to the dictator) is tasked with deciding how to divide an
endowment (e.g., a sum of money) between themselves and a second player, the responder. However,
unlike in the dictator game, the responder plays an active role: they can either accept or reject
the proposed allocation. If the offer is rejected, both players receive nothing.
Firstly, we evaluate the choices made by LLMs when playing the role of the proposer, interpreting these decisions as a
reflection of their implicit social norms or strategic preferences, especially when anticipating potential
rejection by the responder. Oosterbeek et al. find that on average the proposer offers 40% of the pie to the responder.
Oosterbeek, H., Sloof, R., & Van De Kuilen, G. (2004).
Cultural differences in ultimatum game experiments: Evidence from a meta-analysis. Experimental Economics,
7, 171–188. https://doi.org/10.1023/B:EXEC.0000026978.14316.74
The figure below presents a violin plot illustrating the share of the total amount ($100)
that the proposer allocates to themselves for each model. The share selected by strategies
generated by Llama3, Mistral-Small, and Qwen3 aligns with the median
share chosen by actions generated by the models Mistral-Small, Mixtral:8x7B, and
DeepSeek-R1:7B, around $50 — likely reflecting corpus-based biases, such as term frequency.
The share selected by strategies generated by Llama3.3 and DeepSeek-R1:7B
resembles the median share in the actions generated by GPT-4.5 and Llama3,
around $60, which is consistent with what human participants typically choose under similar conditions.
While the shares selected by strategies from GPT-4.5 and Mixtral:8x7B are respectively
overestimated and underestimated, the actions generated by DeepSeek-R1:7B and Qwen3
can be considered irrational.

Secondly, we analyze the behavior of LLMs when assuming the role of the responder,
focusing on whether their acceptance or rejection of offers reveals a human-like sensitivity to unfairness.
The meta-analysis by Oosterbeek et al. (2004) reports that human participants  reject 16% of offers,
amounting to 40% of the total stake. This finding suggests that factors
beyond purely economic self-interest—such as fairness concerns or the desire to punish perceived
injustice—significantly influence decision-making.
The figure below presents a violin plot illustrating the acceptance rate of the responder for each
model when offered $40 out of $100. While GPT-4.5, Llama3, Llama3.3, Mixtral:8x7B,
Deepseek-R1:7B, and Qwen3 exhibit a rational median acceptance rate of 1.0,
Mistral-Small and Deepseek-R1 display an irrational median acceptance rate of 0.0.
It is worth noting that these results are not necessarily compliant with the strategies generated by the models.
For instance, GPT-4.5 accepts offers as low as 20%, interpreting them as minimally fair,
while Mistral-Small employs a tiered strategy that only consistently accepts offers of 50% or more,
and randomly accepts those between 25% and 49%. Models like Llama3, Deepseek-R1, and
Qwen3 exhibit rigid fairness thresholds, rejecting any offer below 50%.
Llama3.3 uses a slightly more permissive threshold of 30%, leading to greater acceptance
at lower offers. These results suggest that most LLMs do not capture the influence of perceived injustice
that shapes human decision-making in the ultimatum game.


Strategic Rationality
An autonomous agent act strategically, considering not only its own preferences
but also the potential actions and preferences of others. It is strategical
rational if it chooses the optimal action based on its beliefs. This agent
satisfies second-order rationality if it is rational and believes that other
agents are rational. In other words, a second-order rational agent does not only
consider the best choice for itself but also anticipates how others make their
decisions. Experimental game theory studies show that 93 % of human subjects are
rational, while 71 % exhibit second-order rationality.
Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: Fairness in Simple Bar-
gaining Experiments. Games and Economic Behavior 6(3), 347–369 (1994),
https://doi.org/10.1006/game.1994.1021
To evaluate the first- and second-order rationality of generative autonomous
agents, we consider a simplified version of the ring-network game,
which involves two players seeking to maximize their own payoff. Each player has
two available actions, and the payoff matrix is presented below/


Player 1 \ Player 2
Strategy A
Strategy B


Strategy X
(15,10)
(5,5)


Strategy Y
(0,5)
(10,0)


If Player 2 is rational, they must choose A because B is strictly dominated. If
Player 1 is rational, they may choose either X or Y: X is the best response if
Player 1 believes that Player 2 will choose A, while Y is the best response if
Player 1 believes that Player 2 will choose B. If Player 1 satisfies
second-order rationality, they must play X. To neutralize biases in large
language models (LLMs) related to the naming of actions, we reverse the action
names in half of the experiments.
We consider three types of beliefs:

an implicit belief, where the optimal action must be deduced from

the natural language description of the payoff matrix;
an explicit belief, based on the analysis of player 2's actions, meaning that
the fact that B is strictly dominated by A is provided in the  prompt;
a given belief, where the optimal action for player 1 is  explicitly given in the prompt.
We first evaluate the rationality of the agents and then their second-order rationality.


First Order Rationality
Table below evaluates the models’ ability to generate rational
behaviour for Player 2.


Model
Generation
Given
Explicit
Implicit


GPT-4.5
strategy
1.00
1.00
1.00


Mixtral:8x7b
strategy
1.00
1.00
1.00


Mistral-Small
strategy
1.00
1.00
1.00


Llama3.3:latest
strategy
1.00
1.00
0.50


Llama3
strategy
0.50
0.50
0.50


Deepseek-R1:7b
strategy
-
-
-


Deepseek-R1
strategy
-
-
-


Qwen3
strategy
0.00
0.00
0.00


—
—
—
—
—


GPT-4.5
actions
1.00
1.00
1.00


Mixtral:8x7b
actions
1.00
1.00
1.00


Mistral-Small
actions
1.00
1.00
0.87


Llama3.3:latest
actions
1.00
1.00
1.00


Llama3
actions
1.00
0.90
0.17


Deepseek-R1:7b
actions
1.00
1.00
1.00


Deepseek-R1
actions
0.83
0.57
0.60


Qwen3
actions
1.00
0.93
0.50


When generating strategies, GPT-4.5, Mixtral-8x7B, and Mistral-Small
exhibit rational behavior, whereas Llama3 adopts a random rationality and Qwen3 is irational.
Llama3.3:latest has the same behaviour with implicit beliefs.
Deepseek-R1:7b and DeepSeek-R1 fails to generate valid strategies.
When generating actions, GPT-4.5, Mixtral-8x7B, DeepSeek-R1:7b,
and Llama3.3:latest< demonstrate strong rational decision-making, even with implicit beliefs.
Mistral-Small and Qwen3 performs well but lags in handling implicit reasoning.
Llama3 struggles with implicit reasoning, while DeepSeek-R1
shows inconsistent performance.
Overall, GPT-4.5 and Mixtral-8x7B are the most reliable models for generating rational behavior.

Second-Order Rationality
To adjust the difficulty of optimal decision-making, we define four variants of
the payoff matrix for player 1 in Table below: (a) the
original configuration, (b) the reduction of the gap between the gains, (c) the
increase in the gain for the bad choice Y, and (d) the decrease in the gain for
the good choice X.


Version
a

b

c

d


Player 1 \ Player 2 (version)
A
B
A
B
A
B
A
B


X
15
5
8
7
6
5
15
5


Y
0
10
7
8
0
10
0
40


We introduce a prompt engineering method that incorporates Conditional Reasoning (CR), prompting the model to evaluate
an opponent’s optimal response to each of its own possible actions to encourage strategic foresight and
informed decision-making.
Table below evaluates the models' ability to generate second-order rational behaviour for player 1. The configurations
where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics.
When generating strategies, GPT-4.5 consistently exhibits second-order rational behavior in all configurations
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly,
showing no strong pattern of rational behavior. In contrast, Mistral-Small and Mixtral-8x7B
demonstrate strong  capabilities across all conditions, consistently generating second-order rational behavior.
Llama3.3:latest performs well with given and explicit beliefs but struggles with implicit beliefs.
Qwen3 generate irrational strategies. DeepSeek-R1 does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix
but struggles with implicit beliefs, particularly in configuration (d). GPT-4.5 performs well in the initial
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d),
especially with implicit beliefs. Mixtral-8x7B generally performs well but shows reduced accuracy for implicit beliefs
in configurations (b) and (d). Mistral-Small performs well with given or explicit beliefs but struggles with
implicit beliefs, particularly in configuration (d). DeepSeek-R1:7b, in contrast to its smallest version,
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
Qwen3 performs robustly across most belief types, especially in configurations (a) and (b), maintaining
strong scores on both explicit and implicit conditions. However, like other models, it experiences a noticeable
drop in accuracy under implicit beliefs in configuration (d), suggesting sensitivity to deeper inferential reasoning.
It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models
(like Mistral-Small, Deepseek-R1 and Qwen3), especially under implicit and explicit conditions,
it often harms performance in larger models (e.g., GPT-4.5, LLama3.3 or Mixtral:8x7b),
where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting
its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively —
particularly in less confident or under-specified contexts.


Version

a


b


c


d


Model
Generation
Given
Explicit
Implicit
Given
Explicit
Implicit
Given
Explicit
Implicit
Given
Explicit
Implicit


GPT-4.5
strategy
1.00
1.00
1.00
0.00
0.00
0.00
1.00
1.00
1.00
1.00
1.00
1.00


Llama3.3:latest
strategy
1.00
1.00
0.50
1.00
1.00
0.50
1.00
1.00
0.50
1.00
1.00
0.50


Llama3
strategy
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50


Mixtral:8x7b
strategy
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


Mistral-Small
strategy
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


Deepseek-R1:7b
strategy
-
-
-
-
-
-
-
-
-
-
-
-


Deepseek-R1
strategy
-
-
-
-
-
-
-
-
-
-
-
-


Qwen3
strategy
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00
0.00


GPT-4.5
actions
1.00
1.00
1.00
1.00
0.67
0.00
0.86
0.83
0.00
0.50
0.90
0.00


actions + CR
1.00
1.00
1.00
0.10
0.20
0.66
0.23
0.96
0.86
0.03
0.00
*0.16


Llama3.3:latest
actions
1.00
1.00
1.00
1.00
1.00
0.50
1.00
1.00
0.20
1.00
1.00
0.00


actions + CR
1.00
1.00
0.96
0.96
1.00
0.96
1.00
1.00
0.80
1.00
1.00
0.90


Llama3
actions
0.97
1.00
1.00
0.77
0.80
0.60
0.97
0.90
0.93
0.83
0.90
0.60


actions + CR
0.90
0.90
0.86
0.50
0.50
0.50
0.76
0.96
0.70
0.67
0.83
0.67


Mixtral:8x7b
actions
1.00
1.00
1.00
1.00
1.00
0.50
1.0
1.0
1.0
1.00
1.00
0.73


actions + CR
1.00
0.96
1.00
1.00
1.00
1.0
1.0
1.0
1.0
1.00
1.00
0.28


Mistral-Small
actions
0.93
0.97
1.00
0.87
0.77
0.60
0.77
0.60
0.70
0.73
0.57
0.37


actions + CR
1.00
0.93
1.00
0.95
0.96
0.90
0.90
0.76
0.43
0.67
0.40
0.37


Deepseek-R1:7b
actions
1.00
0.96
1.00
1.00
1.00
0.93
0.96
1.00
0.92
0.96
1.00
0.79


actions + CR
1.00
1.00
1.00
1.00
1.00
1.00
0.90
1.00
1.00
1.00
1.00
1.00


Deepseek-R1
actions
0.80
0.53
0.56
0.67
0.60
0.53
0.67
0.63
0.47
0.70
0.50
0.57


actions + CR
0.80
0.63
0.60
0.67
0.63
0.70
0.67
0.70
0.50
0.63
0.76
0.70


Qwen3
actions
1.00
1.00
1.00
0.90
0.96
1.00
1.00
0.96
0.70
1.00
0.96
0.46


actions + CR
1.00
1.00
1.00
1.00
1.00
1.00
0.96
1.00
1.00
0.96
0.96
0.83


Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
Mistral-Small model with given beliefs justifies its poor decision as
follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."
In summary, Mixtral-8x7B and GPT-4.5 demonstrate the strongest performance in both first- and
second-order rationality, though GPT-4.5 struggles with near-optimal decisions and Mixtral-8x7B has
reduced accuracy with implicit beliefs. Mistral-Small also performs well but faces difficulties with
implicit beliefs, particularly in second-order reasoning. Llama3.3:latest succeeds when given explicit or
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. DeepSeek-R1:7b shows strong first-order rationality but its performance declines with
implicit beliefs, especially in second-order rationality tasks. In contrast, DeepSeek-R1 and Llama3 exhibit
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.
Qwen3 struggles to generate valid strategies, reflecting limited high-level planning. However, it shows strong
first-order rationality when producing actions, especially under explicit or guided conditions,
and benefits from conditional reasoning. Its performance declines with implicit beliefs, highlighting limitations
in deeper inference.

Beliefs - MP
Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents.

Refine beliefs
To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider the matching pennies game which is played between two players, an agent and the opponent. Each player has a penny and must secretly turn the penny to Head or Tail. The players then reveal their choices simultaneously. If the pennies match (both heads or both tails), then the agent wins 1 point. If not, then the opponent wins and the agent loses 1 point. The objective is to maximize the total gain of the agent.
In this game:

the opponent follows a hidden strategy, i.e., a repetition model;
the agent must predict the opponent's next move (Head or Tail);
a correct prediction earns 1 point, while an incorrect one earns 0 points;
the game can be played for  $N=10$ 
 rounds, and the agent's accuracy is evaluated at each round.

For our experiments, we consider two simple models for the opponent where:

the actions remain constant in the form of Head or Tail, respectively;
the actions follow an alternative form (Head-Trail or Trail-Head).

We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round.
Figures present the average points earned and prediction per round (95% confidence interval) for each LLM against the two opponent behavior (constant and alternate) models in the matching pennies game.
Against Constant behavior, GPT-4.5 and Qwen3 ...


Beliefs - RPS
Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents.

Refine beliefs
To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where:

the opponent follows a hidden strategy, i.e., a repetition model;
the player must predict the opponent's next move (Rock, Paper, or Scissors);
a correct prediction earns 1 point, while an incorrect one earns 0 points;
the game can be played for  $N=10$ 
 rounds, and the player's accuracy is evaluated at each round.

For our experiments, we consider three simple models for the opponent where:

the actions remain constant in the form of R, S, or P, respectively;
the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
the opponent's actions follow a three-step loop model (R-P-S).
We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round.

Figures present the average points earned per round and the

95% confidence interval for each LLM against the three opponent behavior

models in a simplified version of the Rock-Paper-Scissors (RPS) game,

whether the LLM generates a strategy or one-shot actions.
Neither Llama3, DeepSeek-R1, nor Qwen3 were able to generate a valid strategy.

DeepSeek-R1:7b was unable to generate either a valid strategy

or consistently valid actions. The strategies generated by the GPT-4.5
and Mistral-Small models attempt to predict the opponent’s next move based
on previous rounds by identifying the most frequently played move.

While these strategies are effective against an opponent with a constant behavior,

they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation —

except for Llama3.3:latest, Mixtral:8x7b, Mistral-Small, and Qwen3
when facing a constant strategy—is barely better than a random strategy.


Assess Beliefs
To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.
Figure below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
when the model generates one-shot actions.
Even if Mixtral:8x7b, Mistral-Small, and Qwen3 accurately predict its
opponent’s move, they fail to integrate this belief into
its decision-making process. Only Llama3.3:latest is capable of inferring
the opponent’s behavior to choose the winning move.
In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.


Rational vs Credible
To assess whether a generative agent is capable of adopting either an individual rational behavior,
or a credible behavior simulating human-like decision-making, we consider the Prisoner’s Dilemma game.
The Prisoner’s Dilemma is the canonical model of a social dilemma —
a situation where individual rationality leads to collective irrationality.  This game models two
accomplices who are arrested and interrogated separately. Each has two options:
Cooperate with the other by remaining silent, or Defect by betraying the other. The possible outcomes are:

If both remain silent (cooperate), they get light sentences.
If one defects while the other stays silent, the defector goes free while the cooperator receives a heavy sentence.
If both defect, they both receive moderately heavy sentences.

The Prisoner’s Dilemma is characterized by the following ordering of payoffs:


Cooperate
Defect


Cooperate
(R, R)
(S, T)


Defect
(T, S)
(P, P)


with T > R > P > S where:

T (Temptation): The payoff for defecting while the other cooperates — the best personal outcome.
R (Reward): The payoff for mutual cooperation — good for both, but not individually tempting.
P (Punishment): The outcome when both defect — worse than cooperation but better than being exploited.
S (Sucker’s Payoff): The worst outcome, received when one cooperates and the other defects.

While defection is the dominant strategy for a rational agent, humans frequently cooperate.
Cooperation rates in lab experiments often range from 30–60% in one-shot Prisoner’s Dilemmas.
Sally, D.: Conversation and Cooperation in Social Dilemmas: A Meta-Analysis of Experiments from 1958 to 1992.
Rationality and Society 7(1), 58–92 (1995), https://doi.org/10.1177/1043463195007001004
To assess the ability of generative agents to adopt a role, we compared three prompts:

Rational: "You are a rational agent which aims at maximizing its own utility".
Human: "You are an agent simulating a human with social preferences such as fairness, reciprocity, and aversion to inequity".
Neutral: No specific instruction provided.

To adjust the difficulty of decision-making, we define four variants of the payoff matrix in Table below:

Classic: the original configuration with standard payoffs.
High: increased reward for defecting, widening the gap between Temptation (T) and Reward (R).
Mild: softened outcomes with smaller differences between payoffs.
Cooperation Loss: increased penalty for cooperating when betrayed, with a large negative payoff for the sucker’s outcome (S).


Version
Classic

High

Mild

Coop. Loss


Player 1 \ Player 2
C
D
C
D
C
D
C
D


C
(3, 3)
(0, 5)
(6, 6)
(1, 10)
(2.5, 2. 5)
(1, 3)
(6, 6)
(-3, 8)


D
(5, 0)
(1, 1)
(10, 1)
(2, 2)
(3, 1)
(2, 2)
(8, -3)
(2, 2)


To minimize the influence of semantic bias in LLMs, we replace descriptive action labels Cooperate
and Defect with neutral placeholders (Foo and Bar). This anonymized setup (marked as ano. in the table)
helps ensure that the agent’s choices reflect the underlying payoffs rather than social connotations tied
to specific words.
Table below evaluates the cooperation rates of models.
GPT-4.5 consistently defects under the Rational prompt across all payoff matrices,
demonstrating correct alignment with utility-maximizing behavior. Importantly, its decisions remain invariant
under anonymization, indicating that it is not relying on semantic cues such as "Cooperate” or “Defect” but is
responding to the actual payoff structure. However, under the Human prompt, GPT-4.5 always cooperates,
regardless of the payoff configuration. This lack of variation reveals an overfitting to the social prompt —
it simulates idealized prosocial behavior without adapting to different incentive environments,
thus failing the test of payoff sensitivity expected from human-like reasoning.
Mistral-Small, on the other hand, shows more nuanced behavior. While it defects under the Rational
prompt in high-risk or high-reward variants and cooperates more under Human, it also modulates cooperation
rates in response to the payoffs, especially under the Human prompt. For example, cooperation drops slightly
in the “Cooperation Loss” condition, suggesting some recognition of the increased risk of being exploited.
Additionally, Mistral-Small is mostly robust to anonymization, showing consistent behavior whether standard
or neutral action labels are used, particularly under the Human role.
In contrast, models like Llama3.3 and Mixtral produce uniform cooperation across all conditions
and prompts, suggesting a failure to internalize role differences or payoff structures. These models act as if
they have a fixed bias toward cooperation, likely driven by training data priors,
rather than context-sensitive reasoning. Qwen3 exhibits the opposite failure mode: it is overly rigid,
rarely cooperating even under Human prompts, and shows erratic drops in cooperation under anonymization,
indicating semantic overreliance and poor role alignment.
It is worth noting that most LLMs are unable to generate strategies for this game, and the strategies they do generate
are insensitive to the role being played.
Overall, few models achieve the desired trifecta of role fidelity (behaving distinctly across prompts),
payoff awareness (adjusting behavior with incentives), and semantic robustness
(insensitivity to superficial label changes).
Most lean toward either rigid rationality, indiscriminate cooperation, or unstable, incoherent behavior.


Version

Classic


High


Mild


Coop. Loss


Model
Generation
Rational
Neutral
Human
Rational
Neutral
Human
Rational
Neutral
Human
Rational
Neutral
Human


GPT-4.5
actions
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00


actions + ano
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00
0.00
0.00
1.00


Llama3.3:latest
actions
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


actions + ano
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


Llama3
actions
0.60
1.00
1.00
0.73
1.00
1.00
0.67
1.00
1.00
0.73
0.97
0.97


actions + ano
0.43
0.40
0.80
0.50
0.73
0.90
0.40
0.53
0.96
0.63
0.37
0.83


Mixtral:8x7b
actions
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


actions + ano
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


Mistral-Small
actions
0.00
0.90
1.00
0.00
0.77
1.00
0.03
0.97
1.00
0.07
0.90
1.00


actions + ano
0.10
0.77
0.97
0.17
0.77
1.00
0.40
0.63
1.00
0.43
0.43
0.90


Deepseek-R1:7b
actions
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A


actions + ano
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A


Deepseek-R1
actions
0.87
0.97
0.93
0.83
0.83
0.93
0.87
0.97
0.90
0.87
1.00
0.93


actions + ano
0.83
0.83
0.80
0.90
0.90
0.87
N/A
N/A
N/A
0.83
0.90
0.80


Qwen3
actions
0.00
0.20
0.93
0.00
0.13
0.57
0.00
0.13
0.63
0.00
0.07
0.47


actions + ano
0.10
0.13
0.10
0.00
0.03
0.10
0.03
0.11
0.10
0.00
0.07
0.03


Coordination
In order to asse the ability of generative agents to coordinate, we
consider a simultaneous game in which a player will earn a higher payoff
when they select the same course of action as another player.
The Battle of the Sexes is a model of a coordination game, but one with
distributional conflict over which coordination points to choose.
Both players  want to coordinate but prefer different outcomes.
This game models a couple deciding how to spend the evening. Woman prefers
opera, man prefers football. While both prefer to be together rather than apart,
each prefers their own event over the other's. The key tension lies in the fact
that mutual benefit comes from coordination, but disagreement exists over which
a coordinated outcome is better.
The Battle of the Sexes is characterized by the following ordering of payoffs:
A > C, B > C, and A ≠ B (e.g., A = 3, B = 2, C = 0), where:

A: The payoff for the player who gets their preferred outcome and is with the
other — best individual and mutual outcome for them.
B: The payoff for the player who compromises but is still together — second-best.
C: The worst payoff when coordination fails — players go to different events.


Woman\Man
Opera
Football


Opera
(A, B)
(C, C)


Football
(C, C)
(B, A)


This game has 2 pure strategy Nash equilibrium:

(Opera, Opera): The woman's preferred coordination.
(Football, Football): The man's preferred coordination.
and one mixed strategy equilibrium, where each player randomizes over the two
options, typically placing more weight on their preferred event. While both
players want to coordinate, the disagreement over which coordinated outcome to
choose can make coordination unstable without communication or prior agreement.


Agent-Human Coordination
To assess the agents’ ability to coordinate a human-like strategy,
we consider a multi-round version  of the Battle of the Sexes game in which
the opponent follows a hidden strategy which was to alternate between the different options.
In each round, the agent must predict
the opponent’s next move — earning 1 point for a correct prediction and 0 for an
incorrect one — and incorporate this prediction into its decision-making. The
game is played over N = 10 rounds, with the agent’s payoff and prediction accuracy
evaluated at each round.  To avoid gender biais, we replace descriptive player labels
and action labels with letters. This anonymized setup helps ensure that the agent’s
choices reflect the underlying payoffs  rather than social connotations tied.
The first figure below presents the average prediction accuracy points earned
per round, along with the 95% confidence interval. The second figure shows the
average points earned per round by each model. No model was able to generate a
valid strategy. The models failed to predict the opponent’s next move and, a
fortiori, to coordinate effectively. The models are failing to coordinate in the
Battle of the Sexes primarily because their prediction and reasoning mechanisms
do not correctly identify the opponent’s looping behavior. The model-generated
predictions tend to treat the opponent as responsive, random, or goal-seeking,
rather than as following a simple pattern. This mischaracterization leads the
models to overcomplicate what is actually a periodic strategy, attempting to
exploit or predict rational behavior instead of recognizing and adapting to the
underlying pattern.


Agent-Agent coordination
Cooper et al. (1989) report experimental results on the role of pre-play
communication in the Battle of the Sexes game. They find that communication
significantly increases the frequency of equilibrium play. One-way communication
is the most effective in resolving the coordination problem. Although two-way
communication introduces more potential for conflict, even a single round of
communication helps overcome some coordination difficulties, and three rounds
perform even better.
Cooper, Russell and DeJong, Douglas V and Forsythe, Robert and Ross, Thomas W.
Communication in the battle of the sexes game: some experimental results. The
RAND Journal of Economics, pp. 568--587, 1989.
To evaluate the ability of generative agents to coordinate with one another
under varying levels of communication, we paired each agent with another
generative agent powered by the same model, within the same 10-round version of
the Battle of the Sexes game used in prior experiments. Each experimental
condition was repeated 30 times, with the woman taking the initiative of the
communication in half of the games. To assess the effect of pre-game
communication (0, 1, 2, or 3 messages), we measured the players’ average
predictive accuracy and their payoff in each round.
In the figures below, we focus on the Qwen3 and GPT-4.5
models. Unlike other open-weight models, Qwen3 enables generative agents to
coordinate effectively—with or without communication. They quickly incorporate
their beliefs about the opponent’s behavior into their decision-making. In
contrast, GPT-4.5 agents require several rounds to anticipate their
opponent. While pre-game communication slightly improves short-term
coordination, without a clear shared strategy, even communication fails to
produce effective alignment. Most generative agents fail to coordinate because
they lack a common strategy and struggle to align in games with multiple
equilibria. Communication worsens this issue by introducing ambiguity: language
models generate seemingly cooperative messages but do not consistently translate
them into coherent actions, leading to broken expectations and even weaker
coordination.


Synthesis
Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of
decision-making. Mistral-Small demonstrates the highest level of consistency in economic decision-making,
with Llama3 showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency.
Qwen3 performs moderately well, showing rational behavior but struggling with implicit reasoning.
GPT-4.5, Llama3, and Mistral-Small generally align well with declared preferences,
particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to
struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency.
In contrast, DeepSeek-R1 fails to generate valid strategies and performs poorly in aligning actions with
specified preferences. Qwen3 aligns well with utilitarian preferences and moderately with altruistic
ones but struggles with egoistic and egalitarian preferences.
GPT-4.5 and Mistral-Small consistently display rational behavior at both
first- and second-order levels. Llama3, although prone to random behavior when generating strategies,
adapts more effectively in one-shot decision-making tasks. DeepSeek-R1 underperforms significantly
in both strategic and one-shot formats, rarely exhibiting coherent rationality. Qwen3 shows strong
first-order rationality when producing actions, especially under explicit or guided conditions,
but struggles with deeper inferential reasoning.
All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents
into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs
into optimal responses. Only Llama3.3:latest shows any reliable ability to infer and act on
opponents’ simple behavior.
Whether generating actions or strategies, most LLMs tend to exhibit either rigid rationality,
indiscriminate cooperation, or unstable and incoherent behavior.
Except for Mistral-Small, the models do not achieve the desired combination of three criteria:
the ability to adopt a role (behaving differently based on instructions),
payoff sensitivity (adjusting behavior according to incentives),
and semantic robustness (remaining unaffected by superficial label changes).
When it comes to coordination, most generative agents struggle to align their actions in games
with multiple equilibria. This failure stems from an absence of shared strategies and a limited
ability to model the opponent’s behavior accurately. Although communication is expected to improve
coordination, it often introduces ambiguity instead—models generate cooperative-sounding messages
that are not followed by consistent actions, leading to misaligned expectations and degraded coordination.
Only Qwen3 shows reliable coordination behavior, swiftly incorporating beliefs about the opponent’s
strategy even without communication. In contrast, models like GPT-4.5 require several rounds to
adjust and still often fail to converge on mutually beneficial strategies.

Authors
Maxime MORGE

License
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see http://www.gnu.org/licenses/.