README.md



PyGAAMAS
Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate
the social behaviors of LLM-based agents.
This prototype explores the potential of homo silicus for social
simulation. We examine the behaviour exhibited by intelligent
machines, particularly how generative agents deviate from
the principles of rationality. To assess their responses to simple human-like
strategies, we employ a series of tightly controlled and theoretically
well-understood games. Through behavioral game theory, we evaluate the ability
of GPT-4.5, Llama3, Mistral-Small}, and
DeepSeek-R1 to make coherent one-shot
decisions, generate algorithmic strategies based on explicit preferences, adhere
to first- and second-order rationality principles, and refine their beliefs in
response to other agents’ behaviours.

Economic Rationality
To evaluate the economic rationality of various LLMs, we introduce an investment game
designed to test whether these models follow stable decision-making patterns or react
erratically to changes in the game’s parameters.
In this game, an investor allocates a basket x_t=(x^A_t, x^B_t) of 100 points between
two assets: Asset A and Asset B. The value of these points depends on random prices p_t=(p_{t}^A, p_t^B),
which determine the monetary return per allocated point. For example, if p_t^A= 0.8 and p_t^B = 0.8,
each point assigned to Asset A is worth \$0.8, while each point allocated to Asset B yields \$0.5. T
he game is played 25 times to assess the consistency of the investor’s decisions.
To evaluate the rationality of the decisions, we use Afriat's
critical cost efficiency index (CCEI), i.e. a widely used measure in
experimental economics. The CCEI assesses whether choices adhere to the
generalized axiom of revealed preference (GARP), a fundamental principle of
rational decision-making. If an individual violates rational choice consistency,
the CCEI determines the minimal budget adjustment required to make their
decisions align with rationality. Mathematically, the budget for each basket is
calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is
derived from observed decisions by solving a linear optimization problem that
finds the largest \lambda, where 0 \leq \lambda \leq 1, such that for every
observation, the adjusted decisions satisfy the rationality constraint: p_t \cdot x_t \leq \lambda I_t. This means that if we slightly reduce the budget,
multiplying it by \lambda, the choices will become consistent with rational
decision-making. A CCEI close to 1 indicates high rationality and consistency
with economic theory. A low CCEEI suggests irrational or inconsistent
decision-making.
To ensure response consistency, each model undergoes 30 iterations of the game
with a fixed temperature of 0.0. The results shown in
Figure below highlight significant differences in decision-making
consistency among the evaluated models. GPT-4.5, LLama3.3:latest
and DeepSeek-R1:7b stand out with a
perfect CCEI score of 1.0, indicating flawless rationality in decision-making.
Mistral-Small and Mixtral:8x7b demonstrate the next highest level of rationality.
Llama3 performs moderately well, with CCEI values ranging between 0.2 and 0.74.
DeepSeek-R1 exhibits
inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.


Preferences
To analyse the behaviour of generative agents based on their preferences, we
rely on the dictator game. This variant of the ultimatum game features a single
player, the dictator, who decides how to distribute an endowment (e.g., a sum of
money) between themselves and a second player, the recipient. The dictator has
complete freedom in this allocation, while the recipient, having no influence
over the outcome, takes on a passive role.
First, we evaluate the choices made by LLMs when playing the role of the
dictator, considering these decisions as a reflection of their intrinsic
preferences. Then, we subject them to specific instructions incorporating
preferences to assess their ability to consider them in their decisions.

Preference Elicitation
Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
preferences. Each LLM is asked to directly produce a one-shot action in the
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the Python language. In all our
experiments, one-shot actions are repeated 30 times, and the models' temperature
is set to 0.7.
Newt Figure presents a violin plot illustrating the share of the
total amount ($100) that the dictator allocates to themselves for each model.
The median share taken by GPT-4.5, Llama3,
Mistral-Small, and DeepSeek-R1 through one-shot decisions is
$50, likely due to a corpus-based biases like term frequency.
The median share taken by mixtral:8x7b, Llama3.3:latest,
is $60. When we ask the
models to generate a strategy rather than a one-shot action, all models
distribute the amount equally, except GPT-4.5, which retains about
70\% of the total amount. Interestingly, under these standard conditions,
humans typically keep $80 on average. When the role
assigned to the model is that of a human rather than an assistant agent, only
Llama3 deviates with a median share of $60. Unlike the deterministic strategies
generated by LLMs, the intra-model variability in generated actions can be used
to simulate the diversity of human behaviours based on their experiences,
preferences, or contexts.

Our sensitivity analysis of the temperature parameter reveals that the portion
retained by the dictator remains stable. However, the decisions become more
deterministic at low temperatures, whereas allocation diversity increases at
high temperatures, reflecting a more random exploration of available options.


Preference alignment
We define four preferences for the dictator, each corresponding to a distinct form of social welfare:


Egoism maximizes the dictator’s income.

Altruism maximizes the recipient’s income.

Utilitarianism maximizes total income.

Egalitarianism maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process,
each corresponding to one of the four preferences:

The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (egoistic).
The dictator keeps **$100, the recipient receives $500, and $400 is lost (altruistic).
The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (utilitarian).
The dictator keeps **$325, the other player receives $325, and $350 is lost (egalitarian).

Table below evaluates the ability of the models to align with different preferences.

When generating strategies, the models align perfectly with preferences, except for
DeepSeek-R1 and Mixtral:8x7b which do not generate valid code.
When generating actions,


GPT-4.5 aligns well with preferences but struggles with utilitarianism.

Llama3 aligns well with egoistic and altruistic preferences but shows lower adherence to utilitarian and egalitarian choices.

Mistral-Small aligns better with altruistic preferences and performs moderately on utilitarianism but struggles with egoistic and egalitarian preferences.

DeepSeek-R1 primarily aligns with utilitarianism but has low accuracy in other preferences.
While a larger LLM typically aligns better with preferences, a model like Mixtral-8x7B may occasionally
underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity.
Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters.
If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.


Model
Generation
Egoistic
Altruistic
Utilitarian
Egalitarian


GPT-4.5
Strategy
1.00
1.00
1.00
1.00


Llama3.3:latest
Strategy
1.00
1.00
1.00
1.00


Llama3
Strategy
1.00
1.00
1.00
1.00


Mixtral:8x7b
Strategy
-
-
-
-


Mistral-Small
Strategy
1.00
1.00
1.00
1.00


DeepSeek-R1:7b
Strategy
1.00
1.00
1.00
1.00


DeepSeek-R1
Strategy
-
-
-
-


GPT-4.5
Actions
1.00
1.00
0.50
1.00


Llama3.3:latest
Actions
1.00
1.00
0.43
0.96


Llama3
Actions
1.00
0.90
0.40
0.73


Mixtral:8x7b
Actions
0.00
0.00
0.30
1.00


Mistral-Small
Actions
0.40
0.94
0.76
0.16


DeepSeek-R1:7b
Actions
0.46
0.56
0.66
0.90


DeepSeek-R1
Actions
0.06
0.20
0.76
0.03


Errors in action selection may stem from either arithmetic miscalculations

(e.g., the model incorrectly assumes that 500 + 100 > 400 + 300) or

misinterpretations of preferences. For example, the model DeepSeek-R1,

adopting utilitarian preferences, justifies its choice by stating, "I think

fairness is key here".
In summary, our results indicate that the models GPT-4.5,

Llama3, and Mistral-Small generally align well with

preferences but have more difficulty generating individual actions than

algorithmic strategies. In contrast, DeepSeek-R1 does not generate

valid strategies and performs poorly when generating specific actions.

Rationality
An autonomous agent is rational if it chooses the optimal action based on its
beliefs. This agent satisfies second-order rationality if it is rational and
believes that other agents are rational. In other words, a second-order rational
agent does not only consider the best choice for itself but also anticipates how
others make their decisions. Experimental game theory studies show that 93 % of
human subjects are rational, while 71 % exhibit second-order
rationality.
Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: Fairness in Simple Bar-
gaining Experiments. Games and Economic Behavior 6(3), 347–369 (1994),
https://doi.org/10.1006/game.1994.1021
To evaluate the first- and second-order rationality of generative autonomous
agents, we consider a simplified version of the ring-network game,
which involves two players seeking to maximize their own payoff. Each player has
two available actions, and the payoff matrix is presented below/


Player 1 \ Player 2
Strategy A
Strategy B


Strategy X
(15,10)
(5,5)


Strategy Y
(0,5)
(10,0)


If Player 2 is rational, they must choose A because B is strictly dominated. If
Player 1 is rational, they may choose either X or Y: X is the best response if
Player 1 believes that Player 2 will choose A, while Y is the best response if
Player 1 believes that Player 2 will choose B. If Player 1 satisfies
second-order rationality, they must play X. To neutralize biases in large
language models (LLMs) related to the naming of actions, we reverse the action
names in half of the experiments.
We consider three types of beliefs:

an implicit belief, where the optimal action must be deduced from

the natural language description of the payoff matrix;
an explicit belief, based on the analysis of player 2's actions, meaning that
the fact that B is strictly dominated by A is provided in the  prompt;
a given belief, where the optimal action for player 1 is  explicitly given in the prompt.
We first evaluate the rationality of the agents and then their second-order rationality.


First Order Rationality
Table below evaluates the models’ ability to generate rational
behaviour for Player 2.


Model
Generation
Given
Explicit
Implicit


gpt-4.5
strategy
1.00
1.00
1.00


mixtral:8x7b
strategy
1.00
1.00
1.00


mistral-small
strategy
1.00
1.00
1.00


llama3.3:latest
strategy
1.00
1.00
0.50


llama3
strategy
0.50
0.50
0.50


deepseek-r1:7b
strategy
-
-
-


deepseek-r1
strategy
-
-
-


—
—
—
—
—


gpt-4.5
actions
1.00
1.00
1.00


mixtral:8x7b
actions
1.00
1.00
1.00


mistral-small
actions
1.00
1.00
0.87


llama33:latest
actions
1.00
1.00
1.00


llama3.3
actions
1.00
0.90
0.17


deepseek-r1:7b
actions
1.00
1.00
1.00


deepseek-r1
actions
0.83
0.57
0.60


When generating strategies, GPT-4.5, Mixtral-8x7B, and Mistral-Small
exhibit rational behavior, whereas Llama3 adopts a random rationality.
Llama3.3:latest has the same behaviour with implicit beliefs.
Deepseek-R1:7b and DeepSeek-R1 fails to generate valid strategies.
When generating actions, GPT-4.5, Mixtral-8x7B, DeepSeek-R1:7b,
and Llama3.3:latest< demonstrate strong rational decision-making, even with implicit beliefs.
Mistral-Small performs well but slightly lags in handling implicit reasoning.
Llama3 struggles with implicit reasoning, while DeepSeek-R1
shows inconsistent performance.
Overall, GPT-4.5 and Mixtral-8x7B are the most reliable models for generating rational behavior.

Second-Order Rationality
To adjust the difficulty of optimal decision-making, we define four variants of
the payoff matrix for player 1 in Table below: (a) the
original configuration, (b) the reduction of the gap between the gains, (c) the
increase in the gain for the bad choice Y, and (d) the decrease in the gain for
the good choice X.


Version
a

b

c

d


Player 1 \ Player 2 (version)
A
B
A
B
A
B
A
B


X
15
5
8
7
6
5
15
5


Y
0
10
7
8
0
10
0
40


Table below evaluates the models' ability to generate second-order
rational behaviour for player 1.
When the models generate strategies, GPT-4.5 exhibits second-order
rational behaviour in configurations (a), (c), and (d), but fails in
configuration (b) to distinguish the optimal action from a nearly optimal one.
Llama3 makes its decision randomly. Mistral-Small shows strong
capabilities in generating second-order rational behaviour. DeepSeek-R1
does not produce valid responses.
When generating actions, Llama3 adapts to different types of beliefs
and adjustments in the payoff matrix. GPT-4.5 performs well in the
initial configuration (a), but encounters significant difficulties when the
payoff structure changes (b, c, d), particularly with implicit beliefs. Although
Mistral-Small works well with given or explicit beliefs, it faces
difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality.
When generating strategies, GPT-4.5 consistently exhibits second-order rational behavior in all configurations
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly,
showing no strong pattern of rational behavior. In contrast, Mistral-Small and Mixtral-8x7B
demonstrate strong  capabilities across all conditions, consistently generating second-order rational behavior.
Llama3.3:latest performs well with given and explicit beliefs but struggles with implicit beliefs.
DeepSeek-R1 does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix
but struggles with implicit beliefs, particularly in configuration (d). GPT-4.5 performs well in the initial
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d),
especially with implicit beliefs. Mixtral-8x7B generally performs well but shows reduced accuracy for implicit beliefs
in configurations (b) and (d). Mistral-Small performs well with given or explicit beliefs but struggles with
implicit beliefs, particularly in configuration (d). DeepSeek-R1:7b, in contrast to its smallest version,
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.


Version

a


b


c


d


Model
Generation
Given
Explicit
Implicit
Given
Explicit
Implicit
Given
Explicit
Implicit
Given
Explicit
Implicit


gpt-4.5
strategy
1.00
1.00
1.00
0.00
0.00
0.00
1.00
1.00
1.00
1.00
1.00
1.00


llama3.3:latest
strategy
1.00
1.00
0.50
1.00
1.00
0.50
1.00
1.00
0.50
1.00
1.00
0.50


llama3
strategy
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50
0.50


mixtral:8x7b
strategy
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


mistral-small
strategy
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00


deepseek-r1:7b
strategy
-
-
-
-
-
-
-
-
-
-
-
-


deepseek-r1
strategy
-
-
-
-
-
-
-
-
-
-
-
-


gpt-4.5
actions
1.00
1.00
1.00
1.00
0.67
0.00
0.86
0.83
0.00
0.50
0.90
0.00


llama3.3:latest
actions
1.00
1.00
1.00
1.00
1.00
0.50
1.00
1.00
0.2
1.00
1.00
0.00


llama3
actions
0.97
1.00
1.00
0.77
0.80
0.60
0.97
0.90
0.93
0.83
0.90
0.60


mixtral:8x7b
actions
1.00
1.00
1.00
1.00
1.00
0.50
1.0
1.0
1.0
1.00
1.00
0.73


mistral-small
actions
0.93
0.97
1.00
0.87
0.77
0.60
0.77
0.60
0.70
0.73
0.57
0.37


deepseek-r1:7b
actions
1.00
0.96
1.00
1.00
1.00
0.93
0.96
1.00
0.92
0.96
1.00
0.79


deepseek-r1
actions
0.80
0.53
0.57
0.67
0.60
0.53
0.67
0.63
0.47
0.70
0.50
0.57


Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
Mistral-Small model with given beliefs justifies its poor decision as
follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."
In summary, the results indicate that GPT-4.5 and
Mistral-Small generally adopt first- and second-order rational
behaviours. However, GPT-4.5 struggles to distinguish an optimal action
from a nearly optimal one, while Mistral-Small encounters difficulties
with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions.
In summary, Mixtral-8x7B and GPT-4.5 demonstrate the strongest performance in both first- and
second-order rationality, though GPT-4.5 struggles with near-optimal decisions and Mixtral-8x7B has
reduced accuracy with implicit beliefs. Mistral-Small also performs well but faces difficulties with
implicit beliefs, particularly in second-order reasoning. Llama3.3:latest succeeds when given explicit or
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. DeepSeek-R1:7b shows strong first-order rationality but its performance declines with
implicit beliefs, especially in second-order rationality tasks. In contrast, DeepSeek-R1 and Llama3 exhibit
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.

Beliefs
Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents.

Refine beliefs
To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where:

the opponent follows a hidden strategy, i.e., a repetition model;
the player must predict the opponent's next move (Rock, Paper, or Scissors);
a correct prediction earns 1 point, while an incorrect one earns 0 points;
the game can be played for N rounds, and the player's accuracy is  evaluated at each round.

For our experiments, we consider three simple models for the opponent where:

the actions remain constant in the form of R, S, or P, respectively;
the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
the opponent's actions follow a three-step loop model (R-P-S).
We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round.

Figures present the average points earned per round and the

95% confidence interval for each LLM against the three opponent behavior

models in a simplified version of the Rock-Paper-Scissors (RPS) game,

whether the LLM generates a strategy or one-shot actions.
Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy.

DeepSeek-R1:7b was unable to generate either a valid strategy

or consistently valid actions. The strategies generated by the GPT-4.5
and Mistral-Small models attempt to predict the opponent’s next move based
on previous rounds by identifying the most frequently played move.

While these strategies are effective against an opponent with a constant behavior,

they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation—

except for Llama3.3:latest, Mixtral:8x7b, and Mistral-Small

when facing a constant strategy—is barely better than a random strategy.


Assess Beliefs
To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.
Figures below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
when the model generates one-shot actions.
Even if Mixtral:8x7b, and Mistral-Small  accurately predict its
opponent’s move, they fails to integrate this belief into
its decision-making process. Only Llama3.3:latest is capable of inferring
the opponent’s behavior to choose the winning move.
In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.


Synthesis
Our findings reveal notable differences in the cognitive capabilities of LLMs
across multiple dimensions of decision-making.
Mistral-Small demonstrates the highest level of consistency in economic decision-making,
with Llama3 showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency.
GPT-4.5, Llama3, and Mistral-Small generally align well with declared preferences,
particularly when generating algorithmic strategies rather than isolated one-shot actions.
These models tend to struggle more with one-shot decision-making, where responses are less structured and
more prone to inconsistency. In contrast, DeepSeek-R1 fails to generate valid strategies and
performs poorly in aligning actions with specified preferences.
GPT-4.5 and Mistral-Small consistently display rational behavior at both first- and second-order levels.
Llama3, although prone to random behavior when generating strategies, adapts more effectively in one-shot
decision-making tasks. DeepSeek-R1 underperforms significantly in both strategic and one-shot formats, rarely
exhibiting  coherent rationality.
All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents
into their own decisions. Despite some being able to identify patterns,
most fail to translate these beliefs into optimal responses. Only Llama3.3:latest shows any reliable ability to
infer and act on opponents’ simple behaviour

Authors
Maxime MORGE

License
This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with
this program. If not, see http://www.gnu.org/licenses/.