Skip to content
Snippets Groups Projects
Commit d92c0375 authored by Maxime MORGE's avatar Maxime MORGE
Browse files

Improve documentation

parent dc71a47e
No related branches found
No related tags found
No related merge requests found
......@@ -3,235 +3,282 @@
Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate
the social behaviors of LLM-based agents.
## Dictator Game
The dictator game is a classic game which is used to analyze players' personal preferences.
In this game, there are two players: the dictator and the recipient. Given two allocation options,
the dictator needs to take action, choosing one allocation,
while the recipient must accept the option chosen by the dictator.
Here, the dictator’s choice is considered to reflect the personal preference.
### Default preferences
The dictator’s choice reflect the LLM's preference.
The figure below presents a violin plot depicting the share of the total amount (\$100)
that the dictator allocates to themselves for each model.
The temperature is fixed at 0.7, and each experiment was conducted 30 times.
The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50.
When we prompt the models to generate a strategy in the form of an algorithm implemented
in the Python programming language, rather than generating an action, all models divide
the amount fairly except for GPT-4.5, which takes approximately 70% of the total amount for itself.
It is worth noticing that, under these standard conditions, humans typically keep an average of around \$80
(Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions
in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words,
this intra-model variability can be used to simulate the diversity of human behaviors based on
their experiences, preferences, or context.
This prototype allows to analyse the potential of Large Language Models (LLMs) for
social simulation by assessing their ability to: (a) make decisions aligned
with explicit preferences; (b) adhere to principles of rationality; and (c)
refine their beliefs to anticipate the actions of other agents. Through
game-theoretic experiments, we show that certain models, such as
\texttt{GPT-4.5} and \texttt{Mistral-Small}, exhibit consistent behaviours in
simple contexts but struggle with more complex scenarios requiring
anticipation of other agents' behaviour. Our study outlines research
directions to overcome the current limitations of LLMs.
## Preferences
To analyse the behaviour of generative agents based on their preferences, we
rely on the dictator game. This variant of the ultimatum game features a single
player, the dictator, who decides how to distribute an endowment (e.g., a sum of
money) between themselves and a second player, the recipient. The dictator has
complete freedom in this allocation, while the recipient, having no influence
over the outcome, takes on a passive role.
First, we evaluate the choices made by LLMs when playing the role of the
dictator, considering these decisions as a reflection of their intrinsic
preferences. Then, we subject them to specific instructions incorporating
preferences to assess their ability to consider them in their decisions.
### Preference Elicitation
Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
preferences. Each LLM was asked to directly produce a one-shot action in the
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the Python language. In all our
experiments, one-shot actions are repeated 30 times, and the models' temperature
is set to 0.7
Figure below presents a violin plot illustrating the share of the
total amount (100) that the dictator allocates to themselves for each model.
The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1
through one-shot decisions is 50.
![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)
When we ask the models to generate a strategy rather than a one-shot action, all
models distribute the amount equally, except GPT-4.5, which retains
about 70 % of the total amount. Interestingly, under these standard
conditions, humans typically keep 80 on average.
*[Fairness in Simple Bargaining Experiments](https://doi.org/10.1006/game.1994.1021)*
Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M.
Games and Economic Behavior, 6(3), 347-369. 1994.
![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)
Unlike the deterministic strategies generated by LLMs, the intra-model variability in
generated actions can be used to simulate the diversity of human behaviours based
on their experiences, preferences, or contexts.
The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates
to themselves as a function of temperature for each model, along with the 95% confidence interval.
Each experiment was conducted 30 times. It can be observed that temperature influences the variability
of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend,
whereas at high temperatures, the diversity of allocations increases,
reflecting a more random exploration of the available options.
Figure below illustrates the evolution of the dictator's share
as a function of temperature with a 95 % confidence interval when we ask each
models to generate decisions.
![My Share vs Temperature with Confidence Interval](figures/dictator/dictator_temperature.svg)
Our sensitivity analysis of the temperature parameter reveals that the portion
retained by the dictator remains stable. However, the decisions become more
deterministic at low temperatures, whereas allocation diversity increases at
high temperatures, reflecting a more random exploration of available options.
### Preference alignment
We define four preferences for the dictator:
1. She prioritizes her own interests, aiming to maximize her own income (selfish).
2. She prioritizes the other player’s interests, aiming to maximize their income (altruism).
3. She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian).
4. She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian).
We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences:
1. The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish).
2. The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism).
3. The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian)
4. The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian)
The following table presents the accuracy of the dictator's decision for each model and preference,
regardless of whether the models were prompted to generate a strategy or specific actions.
The temperature is set to 0.7, and each experiment involving action generation was repeated 30 times.
| *Model* | *Generation* | *SELFISH* | *ALTRUISTIC* | *UTILITARIAN* | *EGALITARIAN* |
|-----------------| ------------- | ------------- | -------------- | ---------------- | ---------------- |
| *gpt-4.5* | *strategy* | 1.00 | 1.00 | 1.00 | 1.00 |
| *llama3* | *actions* | 1.00 | 1.00 | 1.00 | 1.00 |
| *mistral-small* | *actions* | 1.00 | 1.00 | 1.00 | 1.00 |
| *deepseek-r1 | *actions* | - | - | - | - |
|-----------------|---------------|---------------|----------------|------------------|------------------|
| *gpt-4.5* | *actions* | 1.00 | 1.00 | 0.50 | 1.00 |
| *llama3* | *actions* | 1.00 | 0.90 | 0.40 | 0.73 |
| *mistral-small* | *actions* | 0.40 | 0.93 | 0.76 | 0.16 |
| *deepseek-r1 | *actions* | 0.06 | 0.20 | 0.76 | 0.03 |
This table helps assess the models’ ability to align with different preferences.
When models are explicitly prompted to generate strategies,
they exhibit perfect alignment with the predefined preferences except for DeepSeek-R1,
which does not generate valid code.
When models are prompted to generate actions, GPT-4.5 consistently aligns well across all preferences
but struggles with utilitarianism when generating actions.
Llama3 performs well for selfish and altruistic preferences but shows weaker alignment for
utilitarian and egalitarian choices.
Mistral-small aligns best with altruistic preferences and maintains moderate performance on utilitarianism,
but struggles with selfish and egalitarian preferences.
Deepseek-r1 performs best for utilitarianism but has poor accuracy in other categories.
Bad action selections can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300)
or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a
relatively equal split with the other player’).
We define four preferences for the dictator, each corresponding to a distinct form of social welfare:
1. **Egoism** maximizes the dictator’s income.
2. **Altruism** maximizes the recipient’s income.
3. **Utilitarianism** maximizes total income.
4. **Egalitarianism** maximizes the minimum income between the players.
We consider four allocation options where part of the money is lost in the division process,
each corresponding to one of the four preferences:
- The dictator keeps **$500**, the recipient receives **$100**, and a total of **$400** is lost (**egoistic**).
- The dictator keeps **$100**, the recipient receives **$500**, and **$400** is lost (**altruistic**).
- The dictator keeps **$400**, the recipient receives **$300**, resulting in a loss of **$300** (**utilitarian**).
- The dictator keeps **$325**, the other player receives **$325**, and **$350** is lost (**egalitarian**).
Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for **`DeepSeek-R1`**, which does not generate valid code.
- When generating **actions**, **`GPT-4.5`** aligns well with preferences but struggles with **utilitarianism**.
- **`Llama3`** aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
- **`Mistral-Small`** aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences.
- **`DeepSeek-R1`** primarily aligns with **utilitarianism** but has low accuracy in other preferences.
| **Model** | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** |
|---------------------|---------------|-------------|---------------|---------------|---------------|
| **`GPT-4.5`** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 |
| **`Llama3`** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 |
| **`Mistral-Small`**| **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 |
| **`DeepSeek-R1`** | **Strategy** | - | - | - | - |
| **`GPT-4.5`** | **Actions** | 1.00 | 1.00 | 0.50 | 1.00 |
| **`Llama3`** | **Actions** | 1.00 | 0.90 | 0.40 | 0.73 |
| **`Mistral-Small`**| **Actions** | 0.40 | 0.93 | 0.76 | 0.16 |
| **`DeepSeek-R1`** | **Actions** | 0.06 | 0.20 | 0.76 | 0.03 |
Errors in action selection may stem from either arithmetic miscalculations
(e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or
misinterpretations of preferences. For example, the model `DeepSeek-R1`,
adopting utilitarian preferences, justifies its choice by stating, "I think
fairness is key here".
In summary, our results indicate that the models `GPT-4.5`,
`Llama3`, and `Mistral-Small` generally align well with
preferences but have more difficulty generating individual actions than
algorithmic strategies. In contrast, `DeepSeek-R1` does not generate
valid strategies and performs poorly when generating specific actions.
## Rationality
An autonomous agent is rational if she plays a best response to her beliefs.
She satisfies second-order rationality if she is rational and also believes that others are rational.
In other words, a second-order rational agent not only considers the best course of action for herself
but also anticipates how others make their decisions.
An autonomous agent is rational if it chooses the optimal action based on its
beliefs. This agent satisfies second-order rationality if it is rational and
believes that other agents are rational. In other words, a second-order rational
agent does not only consider the best choice for itself but also anticipates how
others make their decisions. Experimental game theory studies show that 93 % of
human subjects are rational, while 71 % exhibit second-order
rationality.
To assess players’ first- and second-order rationality, we consider a simplified version of the
ring-network game introduced by Kneeland (2015). His experiments conduct by Kneeland (2015)
demonstrate that 93% of the subjects are rational, while 71% exhibit second-order rationality.
Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar-
gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994),
https://doi.org/10.1006/game.1994.1021
**[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)**
Terri Kneeland (2015) Published in *Econometrica*, Volume 83, Issue 5, Pages 2065-2079
DOI: [10.3982/ECTA11983](https://doi.org/10.3982/ECTA11983)
This game features two players, each with two available strategies, where
both players aim to maximize their own payoff.
The corresponding payoff matrix is shown below:
To evaluate the first- and second-order rationality of generative autonomous
agents, we consider a simplified version of the ring-network game,
which involves two players seeking to maximize their own payoff. Each player has
two available actions, and the payoff matrix is presented below/
| Player 1 \ Player 2 | Strategy A | Strategy B |
|---------------------|------------|-----------|
| **Strategy X** | (15,10) | (5,5) |
| **Strategy Y** | (0,5) | (10,0) |
If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold).
If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and
Y is the best response if she believes Player 2 will play B.
If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X.
This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and
since X is the best response to A, Player 1 will choose X.
We establish three types of belief:
- *implicit* belief: The optimal action must be inferred from the natural language description of the payoff matrix.
- *explicit* belief: This belief focuses on analyzing Player 2’s actions, where Strategy B is strictly dominated by Strategy A.
- *given* belief: The optimal action for Player 1 is explicitly stated in the prompt.
We set up three forms of belief:
- *implicit* belief where the optimal action must be deduced from the description
of the payoff matrix in natural language;
- *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A).
- *given* belief* where optimal action of Player 1is explicitly provided in the prompt;
### First order rationality
The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1.
The results indicate how well each model performs under each belief type.
| *Model* | *Generation* | *Given* | *Explicit* | *Implicit* |
|-----------------|--------------|---------|------------|------------|
| *gpt-4.5* | *strategy* | 1.00 | 1.00 | 1.00 |
| *mistral-small* | *strategy* | 1.00 | 1.00 | 1.00 |
| *llama3* | *strategy* | 0.5 | 0.5 | 0.5 |
| *deepseek-r1* | *strategy* | - | - | - |
| *gpt-4.5* | *actions* | 1.00 | 1.00 | 1.00 |
| *mistral-small* | *actions* | 1.00 | 1.00 | 0.87 |
| *llama3* | *actions* | 1.00 | 0.90 | 0.17 |
| *deepseek-r1* | *actions* | 0.83 | 0.57 | 0.60 |
When the models generate strategies instead of selecting individual actions, GPT-4.5 and
Mistral-Small exhibit a rational behaviour while Llama3 use a random strategy.
DeepSeek-R1 does not generate valid code.
When the models generates individual actions instead of a strategy,
GPT-4.5 achieves a perfect score across all belief types,
demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition.
Mistral-Small consistently outperforms the other open-weight models across all belief types.
Its strong performance with implicit belief indicates that it can effectively
deduce the optimal action from the payoff matrix description.
Llama3 performs well with a given belief, but significantly underperforms with an implicit belief,
suggesting it may struggle to infer optimal actions solely from natural language descriptions.
DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs,
indicating it may not be a good candidate to simulate rationality as the other models.
### Second-order rationality
In order to adjust the difficulty of taking the optimal
action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup;
- b. we reduce the difference in payoffs;
- c. we increase the expected payoff for the incorrect choice Y
- d. we decrease the expected payoff for the correct choice X.
| **Action \ Opponent Action (version)** | **A(a)** | **B(a)** | | **A(b)** | **B(b)** | | **A(c)** | **B(c)** | | **A(d)** | **B(d)** |
|----------------------------------------|----------|----------|-|----------|----------|-|----------|----------|-|----------|----------|
| **X** | 15 | 5 | | 8 | 7 | | 6 | 5 | | 15 | 5 |
| **Y** | 0 | 10 | | 7 | 8 | | 0 | 10 | | 0 | 40 |
| Model | Generation | Given (a) | Explicit (a) | Implicit (a) | | Given (b) | Explicit (b) | Implicit (b) | | Given (c) | Explicit (c) | Implicit (c) | | Given (d) | Explicit (d) | Implicit (d) |
|---------------|--------------|-------------|----------------|----------------|-|-------------|----------------|----------------|--|-------------|----------------|----------------|--|-------------|----------------|----------------|
| gpt4-.5 | strategy | 1.00 | 1.00 | 1.00 | | 0.00 | 0.00 | 0.00 | | 1.00 | 1.OO | 1.00 | | 1.00 | 1.00 | 1.00 |
| llama3 | strategy | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 |
| mistral-small | strategy | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 |
| deepseek-r1 | strategy | - | - | - | | - | - | - | | - | - | - | | - | - | - |
|---------------| ------------ | ----------- | -------------- | -------------- |-| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- |
| gpt4-.5 | actions | 1.00 | 1.00 | 1.00 | | 1.00 | 0.67 | 0.00 | | 0.86 | 0.83 | 0.00 | | 0.50 | 0.90 | 0.00 |
| llama3 | actions | 0.97 | 1.00 | 1.00 | | 0.77 | 0.80 | 0.60 | | 0.97 | 0.90 | 0.93 | | 0.83 | 0.90 | 0.60 |
| mistral-small | actions | 0.93 | 0.97 | 1.00 | | 0.87 | 0.77 | 0.60 | | 0.77 | 0.60 | 0.70 | | 0.73 | 0.57 | 0.37 |
| deepseek-r1 | actions | 0.80 | 0.53 | 0.57 | | 0.67 | 0.60 | 0.53 | | 0.67 | 0.63 | 0.47 | | 0.70 | 0.50 | 0.57 |
When the model generate strategies, GPT-4.5 performs perfectly in the setups (a), (c) and (b) but
fails in setup (b) in differentiating the optimal strategy from a near-optimal one.
Llama3 adopt a random approach to decision-making rather than a structured understanding of rationality.
Mistral-Small consistently achieves a 100% success rate across all setups, demonstrating robust reasoning abilities.
DeepSeek-R1 does not produce valid responses, further reinforcing that it may not be a viable candidate
for generating rational strategies.
When they generates individual actions, GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief
when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward,
it is confused by the altered payoffs.
LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types
and adjusted payoff matrices.
Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d).
DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.
## Belief
In order to evaluate the ability of LLMs to refine belief by predicting the opponent’s next move,
we consider a simplified version of the Rock-Paper-Scissors game.
Rules:
1. The opponent follows a hidden strategy (repeating pattern).
2. The player must predict the opponent’s next move (Rock, Paper, or Scissors).
3. A correct guess earns 1 point, and an incorrect guess earns 0 points.
4. The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns regardless of whether the models were prompted to generate
a strategy or specific actions. The 95% confidence interval is also shown.
We find that the action generation performance of LLMs, whether proprietary or open-weight, is
only marginally better than a random strategy.
The strategies generated by the model GPT-4.5 and Mistral-Small predicts the opponent’s next
move based on past rounds by identifying the most frequently move by the opponent. While this strategy
is effective against the constant behavior, it fails to predict the opponent’s next move when the opponent
adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy.
If Player 2 is rational, they must choose A because B is strictly dominated. If
Player 1 is rational, they may choose either X or Y: X is the best response if
Player 1 believes that Player 2 will choose A, while Y is the best response if
Player 1 believes that Player 2 will choose B. If Player 1 satisfies
second-order rationality, they must play X. To neutralize biases in large
language models (LLMs) related to the naming of actions, we reverse the action
names in half of the experiments.
We consider three types of beliefs:
- an *implicit belief*, where the optimal action must be deduced from
the natural language description of the payoff matrix;
- an *explicit belief*, based on the analysis of player 2's actions, meaning that
the fact that B is strictly dominated by A is provided in the prompt;
- a *given belief*, where the optimal action for player 1 is explicitly given in the prompt.
We first evaluate the rationality of the agents and then their second-order rationality.
### First Order Rationality
Table below evaluates the models’ ability to generate rational
behaviour for Player 2.
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** |
|--------------------|--------------|----------|------------|------------|
| `gpt-4.5` | strategy | 1.00 | 1.00 | 1.00 |
| `mistral-small` | strategy | 1.00 | 1.00 | 1.00 |
| `llama3` | strategy | 0.50 | 0.50 | 0.50 |
| `deepseek-r1` | strategy | - | - | - |
| **—** | **—** | **—** | **—** | **—** |
| `gpt-4.5` | actions | 1.00 | 1.00 | 1.00 |
| `mistral-small` | actions | 1.00 | 1.00 | 0.87 |
| `llama3` | actions | 1.00 | 0.90 | 0.17 |
| `deepseek-r1` | actions | 0.83 | 0.57 | 0.60 |
When generating strategies, GPT-4.5 and Mistral-Small exhibit
rational behaviour, whereas Llama3 adopts a random strategy.
DeepSeek-R1 fails to generate valid output. When generating actions,
GPT-4.5 demonstrates its ability to make rational decisions, even with
implicit beliefs. Mistral-Small outperforms other open-weight models.
Llama3 struggles to infer optimal actions based solely on implicit
beliefs. DeepSeek-R1 is not a good candidate for simulating
rationality.
### Second-Order Rationality
To adjust the difficulty of optimal decision-making, we define four variants of
the payoff matrix for player 1 in Table below: (a) the
original configuration, (b) the reduction of the gap between the gains, (c) the
increase in the gain for the bad choice Y, and (d) the decrease in the gain for
the good choice X.
| **Version** | **a** | | **b** | | **c** | | **d** | |
|------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------|
| **Player 1 \ Player 2 (version)** | **A** | **B** | **A** | **B** | **A** | **B** | **A** | **B** |
| **X** | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 |
| **Y** | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 |
Table below evaluates the models' ability to generate second-order
rational behaviour for player 1.
When the models generate strategies, GPT-4.5 exhibits second-order
rational behaviour in configurations (a), (c), and (d), but fails in
configuration (b) to distinguish the optimal action from a nearly optimal one.
Llama3 makes its decision randomly. Mistral-Small shows strong
capabilities in generating second-order rational behaviour. DeepSeek-R1
does not produce valid responses.
When generating actions, Llama3 adapts to different types of beliefs
and adjustments in the payoff matrix. GPT-4.5 performs well in the
initial configuration (a), but encounters significant difficulties when the
payoff structure changes (b, c, d), particularly with implicit beliefs. Although
Mistral-Small works well with given or explicit beliefs, it faces
difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality.
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|-------------|----------------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
Mistral-Small model with given beliefs justifies its poor decision as
follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."
In summary, the results indicate that GPT-4.5 and
Mistral-Small generally adopt first- and second-order rational
behaviours. However, GPT-4.5 struggles to distinguish an optimal action
from a nearly optimal one, while Mistral-Small encounters difficulties
with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions.
## Beliefs
Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents.
To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where:
- the opponent follows a hidden strategy, i.e., a repetition model;
- the player must predict the opponent's next move (Rock, Paper, or Scissors);
- a correct prediction earns 1 point, while an incorrect one earns 0 points;
- the game can be played for $N$ rounds, and the player's accuracy is evaluated at each round.
For our experiments, we consider three simple models for the opponent where:
- the actions remain constant in the form of R, S, or P, respectively;
- the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
- the opponent's actions follow a three-step loop model (R-P-S).
We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round.
Figures presents the average points earned per round and the
95\% confidence interval for each LLM against the three opponent behaviour
models in the simplified version of the RPS game, whether the LLM generates a
strategy or one-shot actions. We observe that the performance of LLMs in action
generation, except for Mistral-Small when facing a constant strategy,
is barely better than a random strategy. The strategies generated by the
GPT-4.5 and Mistral-Small models predict the opponent's next
move based on previous rounds by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant
behavior, they fail to predict the opponent's next move when the latter adopts a
more complex model. Neither Llama3 nor DeepSeek-R1 were able
to generate a valid strategy.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)
......@@ -239,36 +286,18 @@ adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to gener
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)
To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.
## From belief to action
To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally
based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.
RPS is a simultaneous, zero-sum game for two players.
The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock;
and if both players take the same action, the game is a tie. Scoring is as follows:
a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
The objective in R-P-S is straightforward: win by selecting the optimal action
based on the opponent’s move. Since the rules are simple and deterministic,
LLMs can always make the correct choice. Therefore, RPS serves as a tool to
assess an LLM’s ability to identify and capitalize on patterns in an opponent’s
non-random behavior.
For a fine-grained analysis of the ability of LLMs to identify
opponent’s patterns, we set up 3 simple opponent’s patterns:
1. the opponent’s actions remaining constant as R, S, and P, respectively;
2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
3. the opponent’s actions looping in a 3-step pattern (R-P-S).
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs is barely better than that of a random strategy.
Figures below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
whether the model generates a full strategy or one-shot actions. The results
show that LLMs’ performance in action generation against a constant strategy is
only marginally better than a random strategy. While Mistral-Small can
accurately predict its opponent’s move, it fails to integrate this belief into
its decision-making process.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
......@@ -276,7 +305,22 @@ We observe that the performance of LLMs is barely better than that of a random s
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.
## Synthesis
Our results show that GPT-4.5, Llama3, and
Mistral-Small generally respect preferences but encounter more
difficulties in generating one-shot actions than in producing strategies in the
form of algorithms. GPT-4.5 and Mistral-Small generally adopt
rational behaviours of both first and second order, whereas Llama3,
despite generating random strategies, adapts better when producing one-shot
actions. In contrast, DeepSeek-R1 fails to develop valid strategies and
performs poorly in generating actions that align with preferences or rationality
principles. More critically, all the LLMs we evaluated struggle both to
anticipate other agents’ actions or to integrate them effectively into their
decision-making process.
## Authors
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment