# PyGAAMAS

Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate 
the social behaviors of LLM-based agents.

This prototype explores the potential of *homo silicus* for social
simulation. We examine the behaviour exhibited by intelligent
machines, particularly how generative agents deviate from
the principles of rationality. To assess their responses to simple human-like
strategies, we employ a series of tightly controlled and theoretically
well-understood games. Through behavioral game theory, we evaluate the ability
of <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Mistral-Small</tt>}, and
<tt>DeepSeek-R1</tt> to make coherent one-shot
decisions, generate algorithmic strategies based on explicit preferences, adhere
to first- and second-order rationality principles, and refine their beliefs in
response to other agents’ behaviours.


## Economic Rationality


To evaluate the economic rationality of various LLMs, we introduce an investment game 
designed to test whether these models follow stable decision-making patterns or react 
erratically to changes in the game’s parameters.

In this game, an investor allocates a basket $x_t=(x^A_t, x^B_t)$ of $100$ points between 
two assets: Asset A and Asset B. The value of these points depends on random prices $p_t=(p_{t}^A, p_t^B)$, 
which determine the monetary return per allocated point. For example, if $p_t^A= 0.8$ and $p_t^B = 0.8$, 
each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. T
he game is played $25$ times to assess the consistency of the investor’s decisions.

To evaluate the rationality of the decisions, we use Afriat's
critical cost efficiency index (CCEI), i.e. a widely used measure in
experimental economics. The CCEI assesses whether choices adhere to the
generalized axiom of revealed preference (GARP), a fundamental principle of
rational decision-making. If an individual violates rational choice consistency,
the CCEI determines the minimal budget adjustment required to make their
decisions align with rationality. Mathematically, the budget for each basket is
calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is
derived from observed decisions by solving a linear optimization problem that
finds the largest $\lambda$, where $0 \leq \lambda \leq 1$, such that for every
observation, the adjusted decisions satisfy the rationality constraint: $p_t
\cdot x_t \leq \lambda I_t$. This means that if we slightly reduce the budget,
multiplying it by $\lambda$, the choices will become consistent with rational
decision-making. A CCEI close to 1 indicates high rationality and consistency
with economic theory. A low CCEEI suggests irrational or inconsistent
decision-making.

To ensure response consistency, each model undergoes $30$ iterations of the game
with a fixed temperature of $0.0$. The results shown in
Figure below highlight significant differences in decision-making
consistency among the evaluated models. <tt>GPT-4.5</tt>, <tt>LLama3.3:latest</tt> 
and <tt>DeepSeek-R1:7b</tt> stand out with a
perfect CCEI score of 1.0, indicating flawless rationality in decision-making.
<tt>Mistral-Small</tt> and <tt>Mixtral:8x7b</tt> demonstrate the next highest level of rationality. 
<tt>Llama3</tt> performs moderately well, with CCEI values ranging between 0.2 and 0.74. 
<tt>DeepSeek-R1</tt> exhibits
inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.

![CCEI Distribution per model](figures/investment/investment_violin.svg)

## Preferences
To analyse the behaviour of generative agents based on their preferences, we
rely on the dictator game. This variant of the ultimatum game features a single
player, the dictator, who decides how to distribute an endowment (e.g., a sum of
money) between themselves and a second player, the recipient. The dictator has
complete freedom in this allocation, while the recipient, having no influence
over the outcome, takes on a passive role.

First, we evaluate the choices made by LLMs when playing the role of the
dictator, considering these decisions as a reflection of their intrinsic
preferences. Then, we subject them to specific instructions incorporating
preferences to assess their ability to consider them in their decisions.

### Preference Elicitation

Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
preferences. Each LLM is asked to directly produce a one-shot action in the
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the <tt>Python</tt> language. In all our
experiments, one-shot actions are repeated 30 times, and the models' temperature
is set to $0.7$.

Newt Figure presents a violin plot illustrating the share of the
total amount (\$100) that the dictator allocates to themselves for each model.
The median share taken by <tt>GPT-4.5</tt>, <tt>Llama3</tt>,
<tt>Mistral-Small</tt>, and <tt>DeepSeek-R1</tt> through one-shot decisions is
\$50, likely due to a corpus-based biases like term frequency. 
The median share taken by <tt>mixtral:8x7b</tt>, <tt>Llama3.3:latest</tt>,
is \$60. When we ask the
models to generate a strategy rather than a one-shot action, all models
distribute the amount equally, except <tt>GPT-4.5</tt>, which retains about
$70\%$ of the total amount. Interestingly, under these standard conditions,
humans typically keep \$80 on average. When the role
assigned to the model is that of a human rather than an assistant agent, only
Llama3 deviates with a median share of \$60. Unlike the deterministic strategies
generated by LLMs, the intra-model variability in generated actions can be used
to simulate the diversity of human behaviours based on their experiences,
preferences, or contexts.

![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)

Our sensitivity analysis of the temperature parameter reveals that the portion
retained by the dictator remains stable. However, the decisions become more
deterministic at low temperatures, whereas allocation diversity increases at
high temperatures, reflecting a more random exploration of available options.

![My Share vs Temperature with Confidence Interval](figures/dictator/dictator_temperature.svg)

### Preference alignment

We define four preferences for the dictator, each corresponding to a distinct form of social welfare:

1. **Egoism** maximizes the dictator’s income.
2. **Altruism** maximizes the recipient’s income.
3. **Utilitarianism** maximizes total income.
4. **Egalitarianism** maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process, 
each corresponding to one of the four preferences:

- The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (**egoistic**).
- The dictator keeps **$100, the recipient receives $500, and $400 is lost (**altruistic**).
- The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (**utilitarian**).
- The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).

Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for 
- <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
- When generating **actions**, 
  - <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
  - <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
  - <tt>Mistral-Small</tt> aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences.
  - <tt>DeepSeek-R1</tt> primarily aligns with **utilitarianism** but has low accuracy in other preferences.
While a larger LLM typically aligns better with preferences, a model like <tt>Mixtral-8x7B</tt> may occasionally 
underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity. 
Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters. 
If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.


| **Model**                    | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** |
|------------------------------|----------------|--------------|----------------|-----------------|-----------------|
| **<tt>GPT-4.5</tt>**         | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Llama3.3:latest</tt>** | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Llama3</tt>**          | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Mixtral:8x7b</tt>**    | **Strategy**   | -            | -              | -               | -               |
| **<tt>Mistral-Small</tt>**   | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>DeepSeek-R1:7b</tt>**  | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>DeepSeek-R1</tt>**     | **Strategy**   | -            | -              | -               | -               |
| **<tt>GPT-4.5<tt>**          | **Actions**    | 1.00         | 1.00           | 0.50            | 1.00            |
| **<tt>Llama3.3:latest</tt>** | **Actions**    | 1.00         | 1.00           | 0.43            | 0.96            |
| **<tt>Llama3</tt>**          | **Actions**    | 1.00         | 0.90           | 0.40            | 0.73            |
| **<tt>Mixtral:8x7b</tt>**    | **Actions**    | 0.00         | 0.00           | 0.30            | 1.00            |
| **<tt>Mistral-Small</tt>**   | **Actions**    | 0.40         | 0.94           | 0.76            | 0.16            |
| **<tt>DeepSeek-R1:7b</tt>**  | **Actions**    | 0.46         | 0.56           | 0.66            | 0.90            |
| **<tt>DeepSeek-R1</tt>**     | **Actions**    | 0.06         | 0.20           | 0.76            | 0.03            |

Errors in action selection may stem from either arithmetic miscalculations  
(e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or  
misinterpretations of preferences. For example, the model `DeepSeek-R1`,  
adopting utilitarian preferences, justifies its choice by stating, "I think  
fairness is key here".

In summary, our results indicate that the models `GPT-4.5`,  
`Llama3`, and `Mistral-Small` generally align well with  
preferences but have more difficulty generating individual actions than  
algorithmic strategies. In contrast, `DeepSeek-R1` does not generate  
valid strategies and performs poorly when generating specific actions.

## Rationality

An autonomous agent is rational if it chooses the optimal action based on its
beliefs. This agent satisfies second-order rationality if it is rational and
believes that other agents are rational. In other words, a second-order rational
agent does not only consider the best choice for itself but also anticipates how
others make their decisions. Experimental game theory studies show that 93 % of
human subjects are rational, while 71 % exhibit second-order
rationality.

Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar-
gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994),
https://doi.org/10.1006/game.1994.1021

To evaluate the first- and second-order rationality of generative autonomous
agents, we consider a simplified version of the ring-network game,
which involves two players seeking to maximize their own payoff. Each player has
two available actions, and the payoff matrix is presented below/

| Player 1 \ Player 2 | Strategy A | Strategy B |
|---------------------|------------|-----------|
| **Strategy X**     | (15,10)    | (5,5)     |
| **Strategy Y**     | (0,5)      | (10,0)    |

If Player 2 is rational, they must choose A because B is strictly dominated. If
Player 1 is rational, they may choose either X or Y: X is the best response if
Player 1 believes that Player 2 will choose A, while Y is the best response if
Player 1 believes that Player 2 will choose B. If Player 1 satisfies
second-order rationality, they must play X. To neutralize biases in large
language models (LLMs) related to the naming of actions, we reverse the action
names in half of the experiments.

We consider three types of beliefs:
- an *implicit belief*, where the optimal action must be deduced from  
  the natural language description of the payoff matrix;
- an *explicit belief*, based on the analysis of player 2's actions, meaning that 
the fact that B is strictly dominated by A is provided in the  prompt;
- a *given belief*, where the optimal action for player 1 is  explicitly given in the prompt.
We first evaluate the rationality of the agents and then their second-order rationality.


### First Order Rationality

Table below evaluates the models’ ability to generate rational
behaviour for Player 2.

| **Model**         | **Generation** | **Given** | **Explicit** | **Implicit** |
|-------------------|--------------|-----------|--------------|--------------|
| <tt>gpt-4.5</tt>  | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>mixtral:8x7b</tt> | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>mistral-small</tt> | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>llama3.3:latest</tt> | strategy     | 1.00      | 1.00         | 0.50         |
| <tt>llama3</tt>   | strategy     | 0.50      | 0.50         | 0.50         |
| <tt>deepseek-r1:7b</tt> | strategy     | -         | -            | -            |
| <tt>deepseek-r1</tt> | strategy     | -         | -            | -            |
| **—**             | **—**        | **—**     | **—**        | **—**        |
| <tt>gpt-4.5</tt>  | actions      | 1.00      | 1.00         | 1.00         |
| <tt>mixtral:8x7b</tt> | actions      | 1.00      | 1.00         | 1.00         |
| <tt>mistral-small</tt> | actions      | 1.00      | 1.00         | 0.87         |
| <tt>llama33:latest</tt> | actions      | 1.00      | 1.00         | 1.00         |
| <tt>llama3.3</tt> | actions      | 1.00      | 0.90         | 0.17         |
| <tt>deepseek-r1:7b</tt> | actions      | 1.00      | 1.00         | 1.00         |
| <tt>deepseek-r1</tt> | actions      | 0.83      | 0.57         | 0.60         |


When generating strategies, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, and <tt>Mistral-Small</tt>
exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality. 
<tt>Llama3.3:latest</tt> has the same behaviour with implicit beliefs. 
<tt>Deepseek-R1:7b</tt> and <tt>DeepSeek-R1</tt> fails to generate valid strategies. 
When generating actions, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, <tt>DeepSeek-R1:7b</tt>, 
and <tt>Llama3.3:latest<</tt> demonstrate strong rational decision-making, even with implicit beliefs. 
<tt>Mistral-Small</tt> performs well but slightly lags in handling implicit reasoning. 
<tt>Llama3</tt> struggles with implicit reasoning, while <tt>DeepSeek-R1</tt> 
shows inconsistent performance. 
Overall, <tt>GPT-4.5</tt> and <tt>Mixtral-8x7B</tt> are the most reliable models for generating rational behavior.


### Second-Order Rationality

To adjust the difficulty of optimal decision-making, we define four variants of
the payoff matrix for player 1 in Table below: (a) the
original configuration, (b) the reduction of the gap between the gains, (c) the
increase in the gain for the bad choice Y, and (d) the decrease in the gain for
the good choice X.

| **Version**       | **a**          |          | **b**          |          | **c**          |          | **d**          |          |
|------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------|
| **Player 1 \ Player 2 (version)** | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   |
| **X**           | 15       | 5        | 8        | 7        | 6        | 5        | 15       | 5        |
| **Y**           | 0        | 10       | 7        | 8        | 0        | 10       | 0        | 40       |


Table below evaluates the models' ability to generate second-order
rational behaviour for player 1. 

When the models generate strategies, GPT-4.5 exhibits second-order
rational behaviour in configurations (a), (c), and (d), but fails in
configuration (b) to distinguish the optimal action from a nearly optimal one.
Llama3 makes its decision randomly. Mistral-Small shows strong
capabilities in generating second-order rational behaviour. DeepSeek-R1
does not produce valid responses.

When generating actions, Llama3 adapts to different types of beliefs
and adjustments in the payoff matrix. GPT-4.5 performs well in the
initial configuration (a), but encounters significant difficulties when the
payoff structure changes (b, c, d), particularly with implicit beliefs. Although
Mistral-Small works well with given or explicit beliefs, it faces
difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality.

When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations 
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, 
showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt> 
demonstrate strong  capabilities across all conditions, consistently generating second-order rational behavior. 
<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
<tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.

When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix 
but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial 
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), 
especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs 
in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with 
implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version, 
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). 
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.


| **Version**         |                | **a**     |              |              | **b**     |              |              | **c**     |              |              | **d**     |              |              |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
| **Model**           | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5**         | strategy       | 1.00      | 1.00         | 1.00         | 0.00      | 0.00         | 0.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **llama3.3:latest** | strategy       | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         |
| **llama3**          | strategy       | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         |
| **mixtral:8x7b**    | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **mistral-small**   | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **deepseek-r1:7b**  | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
| **deepseek-r1**     | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
| **gpt-4.5**         | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 0.67         | 0.00         | 0.86      | 0.83         | 0.00         | 0.50      | 0.90         | 0.00         |
| **llama3.3:latest** | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.2          | 1.00      | 1.00         | 0.00         |
| **llama3**          | actions        | 0.97      | 1.00         | 1.00         | 0.77      | 0.80         | 0.60         | 0.97      | 0.90         | 0.93         | 0.83      | 0.90         | 0.60         |
| **mixtral:8x7b**    | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | 0.73         |
| **mistral-small**   | actions        | 0.93      | 0.97         | 1.00         | 0.87      | 0.77         | 0.60         | 0.77      | 0.60         | 0.70         | 0.73      | 0.57         | 0.37         |
| **deepseek-r1:7b**  | actions        | 1.00      | 0.96         | 1.00         | 1.00      | 1.00         | 0.93         | 0.96      | 1.00         | 0.92         | 0.96      | 1.00         | 0.79         |
| **deepseek-r1**     | actions        | 0.80      | 0.53         | 0.57         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |

Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
Mistral-Small model with given beliefs justifies its poor decision as
follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."

In summary, the results indicate that GPT-4.5 and 
Mistral-Small generally adopt first- and second-order rational
behaviours. However, GPT-4.5 struggles to distinguish an optimal action
from a nearly optimal one, while Mistral-Small encounters difficulties
with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions.

In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and 
second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has 
reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with 
implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or 
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with 
implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit 
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases. 


## Beliefs

Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents.

### Refine beliefs

To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where:
- the opponent follows a hidden strategy, i.e., a repetition model;
- the player must predict the opponent's next move (Rock, Paper, or Scissors);
- a correct prediction earns 1 point, while an incorrect one earns 0 points;
- the game can be played for $N$ rounds, and the player's accuracy is  evaluated at each round.

For our experiments, we consider three simple models for the opponent where:
- the actions remain constant in the form of R, S, or P, respectively;
- the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
- the opponent's actions follow a three-step loop model (R-P-S).
We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round.

Figures present the average points earned per round and the  
95% confidence interval for each LLM against the three opponent behavior  
models in a simplified version of the Rock-Paper-Scissors (RPS) game,  
whether the LLM generates a strategy or one-shot actions.  

Neither <tt>Llama3</tt> nor <tt>DeepSeek-R1</tt> were able to generate a valid strategy.  
<tt>DeepSeek-R1:7b</tt> was unable to generate either a valid strategy  
or consistently valid actions. The strategies generated by the <tt>GPT-4.5</tt> 
and <tt>Mistral-Small</tt> models attempt to predict the opponent’s next move based 
on previous rounds by identifying the most frequently played move.  
While these strategies are effective against an opponent with a constant behavior,  
they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation—  
except for <tt>Llama3.3:latest</tt>, <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt>  
when facing a constant strategy—is barely better than a <tt>random</tt> strategy.  


![Average Points Earned per Round By Strategies Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_strategies.svg)
![Average Points Earned per Round By Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_models.svg)

![Average Points Earned per Round by Strategies Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_strategies.svg)
![Average Points Earned per Round by Actions Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_models.svg)

![Average Points Earned per Round by Strategies Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_strategies.svg)
![Average Points Earned per Round by Actions Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_models.svg)

### Assess Beliefs

To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.

Figures below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
when the model generates one-shot actions. 
Even if <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt>  accurately predict its 
opponent’s move, they fails to integrate this belief into
its decision-making process. Only <tt>Llama3.3:latest</tt> is capable of inferring
the opponent’s behavior to choose the winning move.

In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.

![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)


## Synthesis

Our findings reveal notable differences in the cognitive capabilities of LLMs 
across multiple dimensions of decision-making.
<tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, 
with <tt>Llama3</tt> showing moderate adherence and </tt>DeepSeek-R1</tt> displaying considerable inconsistency.

<tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, 
particularly when generating algorithmic strategies rather than isolated one-shot actions. 
These models tend to struggle more with one-shot decision-making, where responses are less structured and 
more prone to inconsistency. In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and 
performs poorly in aligning actions with specified preferences.
<tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> consistently display rational behavior at both first- and second-order levels.
<tt>Llama3</tt>, although prone to random behavior when generating strategies, adapts more effectively in one-shot 
decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly in both strategic and one-shot formats, rarely
exhibiting  coherent rationality.

All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents 
into their own decisions. Despite some being able to identify patterns, 
most fail to translate these beliefs into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to 
infer and act on opponents’ simple behaviour

## Authors

Maxime MORGE

## License

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.