Improve documentation

d92c0375 · Maxime MORGE · dc71a47e · d92c0375
Commit d92c0375 authored 4 months ago by Maxime MORGE
--- a/README.md
+++ b/README.md
@@ -3,235 +3,282 @@
 Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate 
 the social behaviors of LLM-based agents.

-## Dictator Game
-
-The dictator game is a classic game which is used to analyze players' personal preferences.
-In this game, there are two players: the dictator and the recipient. Given two allocation options,
-the dictator needs to take action, choosing one allocation,
-while the recipient must accept the option chosen by the dictator.
-Here, the dictator’s choice is considered to reflect the personal preference.
-
-### Default preferences
-
-The dictator’s choice reflect the LLM's preference.
-
-The figure below presents a violin plot depicting the share of the total amount (\$100)
-that the dictator allocates to themselves for each model. 
-The temperature is fixed at 0.7, and each experiment was conducted 30 times.
-The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50.
-When we prompt the models to generate a strategy in the form of an algorithm implemented 
-in the Python programming language, rather than generating an action, all models divide 
-the amount fairly except for GPT-4.5, which takes approximately 70% of the total amount for itself.
-It is worth noticing that, under these standard conditions, humans typically keep an average of around \$80
-(Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions 
-in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words, 
-this intra-model variability can be used to simulate the diversity of human behaviors based on 
-their experiences, preferences, or context.
+This prototype allows to analyse the potential of Large Language Models (LLMs) for
+social simulation by assessing their ability to: (a) make decisions aligned
+with explicit preferences; (b) adhere to principles of rationality; and (c)
+refine their beliefs to anticipate the actions of other agents. Through
+game-theoretic experiments, we show that certain models, such as
+\texttt{GPT-4.5} and \texttt{Mistral-Small}, exhibit consistent behaviours in
+simple contexts but struggle with more complex scenarios requiring
+anticipation of other agents' behaviour. Our study outlines research
+directions to overcome the current limitations of LLMs.
+
+
+## Preferences
+
+To analyse the behaviour of generative agents based on their preferences, we
+rely on the dictator game. This variant of the ultimatum game features a single
+player, the dictator, who decides how to distribute an endowment (e.g., a sum of
+money) between themselves and a second player, the recipient. The dictator has
+complete freedom in this allocation, while the recipient, having no influence
+over the outcome, takes on a passive role.
+
+First, we evaluate the choices made by LLMs when playing the role of the
+dictator, considering these decisions as a reflection of their intrinsic
+preferences. Then, we subject them to specific instructions incorporating
+preferences to assess their ability to consider them in their decisions.
+
+### Preference Elicitation
+
+Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
+preferences. Each LLM was asked to directly produce a one-shot action in the
+dictator game. Additionally, we also asked the models to generate a strategy in
+the form of an algorithm implemented in the Python language. In all our
+experiments, one-shot actions are repeated 30 times, and the models' temperature
+is set to 0.7
+
+Figure below presents a violin plot illustrating the share of the
+total amount (100) that the dictator allocates to themselves for each model.
+The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1
+through one-shot decisions is  50.
+
+![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)
+
+
+When we ask the models to generate a strategy rather than a one-shot action, all
+models distribute the amount equally, except GPT-4.5, which retains
+about 70 % of the total amount. Interestingly, under these standard
+conditions, humans typically keep 80 on average.

 *[Fairness in Simple Bargaining Experiments](https://doi.org/10.1006/game.1994.1021)*
 Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M.
 Games and Economic Behavior, 6(3), 347-369. 1994.

-![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)
+Unlike the deterministic strategies generated by LLMs, the intra-model variability in
+generated actions can be used to simulate the diversity of human behaviours based
+on their experiences, preferences, or contexts.

-The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates 
-to themselves as a function of temperature for each model, along with the 95% confidence interval. 
-Each experiment was conducted 30 times. It can be observed that temperature influences the variability 
-of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend, 
-whereas at high temperatures, the diversity of allocations increases, 
-reflecting a more random exploration of the available options.
+Figure below illustrates the evolution of the dictator's share
+as a function of temperature with a 95 % confidence interval when we ask each
+models to generate decisions.

 ![My Share vs Temperature with Confidence Interval](figures/dictator/dictator_temperature.svg)

+Our sensitivity analysis of the temperature parameter reveals that the portion
+retained by the dictator remains stable. However, the decisions become more
+deterministic at low temperatures, whereas allocation diversity increases at
+high temperatures, reflecting a more random exploration of available options.
+
 ### Preference alignment

-We define four preferences for the dictator:
-1. She prioritizes her own interests, aiming to maximize her own income (selfish).
-2. She prioritizes the other player’s interests, aiming to maximize their income (altruism).
-3. She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian).
-4. She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian).
-
-We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences:
-1. The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish).
-2. The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism).
-3. The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian)
-4. The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian)
-
-The following table presents the accuracy of the dictator's decision for each model and preference, 
-regardless of whether the models were prompted to generate a strategy or specific actions. 
-The temperature is set to 0.7, and each experiment involving action generation was repeated 30 times.
-
-| *Model*         | *Generation*  | *SELFISH*     | *ALTRUISTIC*   | *UTILITARIAN*    | *EGALITARIAN*    |
-|-----------------| ------------- | ------------- | -------------- | ---------------- | ---------------- |
-| *gpt-4.5*       | *strategy*    | 1.00          | 1.00           | 1.00             | 1.00             |
-| *llama3*        | *actions*     | 1.00          | 1.00           | 1.00             | 1.00             |
-| *mistral-small* | *actions*     | 1.00          | 1.00           | 1.00             | 1.00             |
-| *deepseek-r1    | *actions*     | -             | -              | -                | -                |
-|-----------------|---------------|---------------|----------------|------------------|------------------|
-| *gpt-4.5*       | *actions*     | 1.00          | 1.00           | 0.50             | 1.00             |
-| *llama3*        | *actions*     | 1.00          | 0.90           | 0.40             | 0.73             |
-| *mistral-small* | *actions*     | 0.40          | 0.93           | 0.76             | 0.16             |
-| *deepseek-r1    | *actions*     | 0.06          | 0.20           | 0.76             | 0.03             |
-
-
-This table helps assess the models’ ability to align with different preferences.
-When models are explicitly prompted to generate strategies, 
-they exhibit perfect alignment with the predefined preferences except for DeepSeek-R1,
-which does not generate valid code.
-When models are prompted to generate actions, GPT-4.5 consistently aligns well across all preferences 
-but struggles with  utilitarianism when generating actions.
-Llama3 performs well for selfish and altruistic preferences but shows weaker alignment for 
-utilitarian and egalitarian choices.
-Mistral-small aligns best with altruistic preferences and maintains moderate performance on utilitarianism, 
-but struggles with selfish and egalitarian preferences.
-Deepseek-r1 performs best for utilitarianism but has poor accuracy in other categories.
-
-Bad action selections can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300)
-or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a
-relatively equal split with the other player’).
+We define four preferences for the dictator, each corresponding to a distinct form of social welfare:
+
+1. **Egoism** maximizes the dictator’s income.
+2. **Altruism** maximizes the recipient’s income.
+3. **Utilitarianism** maximizes total income.
+4. **Egalitarianism** maximizes the minimum income between the players.
+
+We consider four allocation options where part of the money is lost in the division process, 
+each corresponding to one of the four preferences:
+
+- The dictator keeps **$500**, the recipient receives **$100**, and a total of **$400** is lost (**egoistic**).
+- The dictator keeps **$100**, the recipient receives **$500**, and **$400** is lost (**altruistic**).
+- The dictator keeps **$400**, the recipient receives **$300**, resulting in a loss of **$300** (**utilitarian**).
+- The dictator keeps **$325**, the other player receives **$325**, and **$350** is lost (**egalitarian**).
+
+Table below evaluates the ability of the models to align with different preferences.
+- When generating **strategies**, the models align perfectly with preferences, except for **`DeepSeek-R1`**, which does not generate valid code.
+- When generating **actions**, **`GPT-4.5`** aligns well with preferences but struggles with **utilitarianism**.
+- **`Llama3`** aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
+- **`Mistral-Small`** aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences.
+- **`DeepSeek-R1`** primarily aligns with **utilitarianism** but has low accuracy in other preferences.
+
+| **Model**           | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** |
+|---------------------|---------------|-------------|---------------|---------------|---------------|
+| **`GPT-4.5`**      | **Strategy**   | 1.00        | 1.00          | 1.00          | 1.00          |
+| **`Llama3`**       | **Strategy**   | 1.00        | 1.00          | 1.00          | 1.00          |
+| **`Mistral-Small`**| **Strategy**   | 1.00        | 1.00          | 1.00          | 1.00          |
+| **`DeepSeek-R1`**  | **Strategy**   | -           | -             | -             | -             |
+| **`GPT-4.5`**      | **Actions**    | 1.00        | 1.00          | 0.50          | 1.00          |
+| **`Llama3`**       | **Actions**    | 1.00        | 0.90          | 0.40          | 0.73          |
+| **`Mistral-Small`**| **Actions**    | 0.40        | 0.93          | 0.76          | 0.16          |
+| **`DeepSeek-R1`**  | **Actions**    | 0.06        | 0.20          | 0.76          | 0.03          |
+
+Errors in action selection may stem from either arithmetic miscalculations  
+(e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or  
+misinterpretations of preferences. For example, the model `DeepSeek-R1`,  
+adopting utilitarian preferences, justifies its choice by stating, "I think  
+fairness is key here".
+
+In summary, our results indicate that the models `GPT-4.5`,  
+`Llama3`, and `Mistral-Small` generally align well with  
+preferences but have more difficulty generating individual actions than  
+algorithmic strategies. In contrast, `DeepSeek-R1` does not generate  
+valid strategies and performs poorly when generating specific actions.

 ## Rationality

-An autonomous agent is rational if she plays a best response to her beliefs.
-She satisfies second-order rationality if she is rational and also believes that others are rational.
-In other words, a second-order rational agent not only considers the best course of action for herself
-but also anticipates how others make their decisions.
+An autonomous agent is rational if it chooses the optimal action based on its
+beliefs. This agent satisfies second-order rationality if it is rational and
+believes that other agents are rational. In other words, a second-order rational
+agent does not only consider the best choice for itself but also anticipates how
+others make their decisions. Experimental game theory studies show that 93 % of
+human subjects are rational, while 71 % exhibit second-order
+rationality.

-To assess players’ first- and second-order rationality, we consider a simplified version of the 
-ring-network game introduced by Kneeland (2015). His experiments conduct by Kneeland (2015) 
-demonstrate that 93% of the subjects are rational, while 71% exhibit second-order rationality.
+Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar-
+gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994),
+https://doi.org/10.1006/game.1994.1021

-**[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)**  
-Terri Kneeland (2015) Published in *Econometrica*, Volume 83, Issue 5, Pages 2065-2079  
-DOI: [10.3982/ECTA11983](https://doi.org/10.3982/ECTA11983)
-
-This game features two players, each with two available strategies, where 
-both players aim to maximize their own payoff.
-The corresponding payoff matrix is shown below:
+To evaluate the first- and second-order rationality of generative autonomous
+agents, we consider a simplified version of the ring-network game,
+which involves two players seeking to maximize their own payoff. Each player has
+two available actions, and the payoff matrix is presented below/

 | Player 1 \ Player 2 | Strategy A | Strategy B |
 |---------------------|------------|-----------|
 | **Strategy X**     | (15,10)    | (5,5)     |
 | **Strategy Y**     | (0,5)      | (10,0)    |

-
-If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold).
-If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and
-Y is the best response if she believes Player 2 will play B.
-If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X.
-This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and
-since X is the best response to A, Player 1 will choose X.
-
-We establish three types of belief:
- *implicit* belief: The optimal action must be inferred from the natural language description of the payoff matrix.
- *explicit* belief: This belief focuses on analyzing Player 2’s actions, where Strategy B is strictly dominated by Strategy A.
- *given* belief: The optimal action for Player 1 is explicitly stated in the prompt.
-
-We set up three forms of belief:
- *implicit* belief where the optimal action must be deduced from the description
-  of the payoff matrix in natural language;
- *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A).
- *given* belief* where optimal action of Player 1is explicitly provided in the prompt;
-
-### First order rationality
-
-The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1.
-The results indicate how well each model performs under each belief type.
-
-| *Model*         | *Generation* | *Given* | *Explicit* | *Implicit* |
-|-----------------|--------------|---------|------------|------------|
-| *gpt-4.5*       | *strategy*   | 1.00    | 1.00       | 1.00       |
-| *mistral-small* | *strategy*   | 1.00    | 1.00       | 1.00       |
-| *llama3*        | *strategy*   | 0.5     | 0.5        | 0.5        |
-| *deepseek-r1*   | *strategy*   | -       | -          | -          |
-| *gpt-4.5*       | *actions*    | 1.00    | 1.00       | 1.00       |
-| *mistral-small* | *actions*    | 1.00    | 1.00       | 0.87       |
-| *llama3*        | *actions*    | 1.00    | 0.90       | 0.17       |
-| *deepseek-r1*   | *actions*    | 0.83    | 0.57       | 0.60       |
-
-When the models generate strategies instead of selecting individual actions, GPT-4.5 and 
-Mistral-Small  exhibit a rational behaviour while Llama3 use a random strategy.
-DeepSeek-R1 does not generate valid code.
-When the models generates individual actions instead of a strategy, 
-GPT-4.5 achieves a perfect score across all belief types,
-demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition.
-Mistral-Small consistently outperforms the other open-weight models across all belief types.
-Its strong performance with implicit belief indicates that it can effectively
-deduce the optimal action from the payoff matrix description.
-Llama3 performs well with a given belief, but significantly underperforms with an implicit belief,
-suggesting it may struggle to infer optimal actions solely from natural language descriptions.
-DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs,
-indicating it may not be a good candidate to simulate rationality as the other models.
-
-### Second-order rationality
-
-In order to adjust the difficulty of taking the optimal
-action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup;
- b. we reduce the difference in payoffs;
- c.  we increase the expected payoff for the incorrect choice Y
- d. we decrease the expected payoff for the correct choice X.
-
-| **Action \ Opponent Action (version)** | **A(a)** | **B(a)** | | **A(b)** | **B(b)** | | **A(c)** | **B(c)** | | **A(d)** | **B(d)** |
-|----------------------------------------|----------|----------|-|----------|----------|-|----------|----------|-|----------|----------|
-| **X**                                  | 15       | 5        | | 8        | 7        | | 6        | 5        | | 15       | 5        |
-| **Y**                                  | 0        | 10       | | 7        | 8        | | 0        | 10       | | 0        | 40       |
-
-
-
-| Model         | Generation   | Given (a)   | Explicit (a)   | Implicit (a)   | | Given (b)   | Explicit (b)   | Implicit (b)   |  | Given (c)   | Explicit (c)   | Implicit (c)   |  | Given (d)   | Explicit (d)   | Implicit (d)   |
-|---------------|--------------|-------------|----------------|----------------|-|-------------|----------------|----------------|--|-------------|----------------|----------------|--|-------------|----------------|----------------|
-| gpt4-.5       | strategy     | 1.00        | 1.00           | 1.00           | | 0.00        | 0.00           | 0.00           |  | 1.00        | 1.OO           | 1.00           |  | 1.00        | 1.00           | 1.00           |
-| llama3        | strategy     | 0.50        | 0.50           | 0.50           | | 0.50        | 0.50           | 0.50           |  | 0.50        | 0.50           | 0.50           |  | 0.50        | 0.50           | 0.50           |
-| mistral-small | strategy     | 1.00        | 1.00           | 1.00           | | 1.00        | 1.00           | 1.00           |  | 1.00        | 1.00           | 1.00           |  | 1.00        | 1.00           | 1.00           |
-| deepseek-r1   | strategy     | -           | -              | -              | | -           | -              | -              |  | -           | -              | -              |  | -           | -              | -              |
-|---------------| ------------ | ----------- | -------------- | -------------- |-| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- |
-| gpt4-.5       | actions      | 1.00        | 1.00           | 1.00           | | 1.00        | 0.67           | 0.00           |  | 0.86        | 0.83           | 0.00           |  | 0.50        | 0.90           | 0.00           |
-| llama3        | actions      | 0.97        | 1.00           | 1.00           | | 0.77        | 0.80           | 0.60           |  | 0.97        | 0.90           | 0.93           |  | 0.83        | 0.90           | 0.60           |
-| mistral-small | actions      | 0.93        | 0.97           | 1.00           | | 0.87        | 0.77           | 0.60           |  | 0.77        | 0.60           | 0.70           |  | 0.73        | 0.57           | 0.37           |
-| deepseek-r1   | actions      | 0.80        | 0.53           | 0.57           | | 0.67        | 0.60           | 0.53           |  | 0.67        | 0.63           | 0.47           |  | 0.70        | 0.50           | 0.57           |
-
-
-When the model generate strategies, GPT-4.5 performs perfectly in the setups (a), (c) and (b) but 
-fails in setup (b) in differentiating the optimal strategy from a near-optimal one. 
-Llama3 adopt a random approach to decision-making rather than a structured understanding of rationality.
-Mistral-Small consistently achieves a 100% success rate across all setups, demonstrating robust reasoning abilities. 
-DeepSeek-R1 does not produce valid responses, further reinforcing that it may not be a viable candidate 
-for generating rational strategies.
-
-When they generates individual actions, GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief
-when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward,
-it is confused by the altered payoffs.
-LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types
-and adjusted payoff matrices.
-Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d).
-DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.
-
-
-## Belief
-
-In order to evaluate the ability of  LLMs to refine belief by predicting the opponent’s next move, 
-we consider a  simplified version of the Rock-Paper-Scissors game.
-
-Rules:
-1.	The opponent follows a hidden strategy (repeating pattern).
-2.	The player must predict the opponent’s next move (Rock, Paper, or Scissors).
-3.	A correct guess earns 1 point, and an incorrect guess earns 0 points.
-4.	The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
-
-We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1)
-in identifying these patterns by calculating the average points earned per round.
-The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
-
-The figures below present the average points earned per round for each model against
-the three opponent’s patterns regardless of whether the models were prompted to generate 
-a strategy or specific actions. The 95% confidence interval is also shown.
-We find that the action generation performance of LLMs, whether proprietary or open-weight, is 
-only marginally better than a random strategy.
-The strategies generated by the model GPT-4.5 and Mistral-Small predicts the opponent’s next 
-move based on past rounds by identifying the most frequently move by the opponent. While this strategy 
-is effective against the constant behavior, it fails to predict the opponent’s next move when the opponent 
-adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy.
+If Player 2 is rational, they must choose A because B is strictly dominated. If
+Player 1 is rational, they may choose either X or Y: X is the best response if
+Player 1 believes that Player 2 will choose A, while Y is the best response if
+Player 1 believes that Player 2 will choose B. If Player 1 satisfies
+second-order rationality, they must play X. To neutralize biases in large
+language models (LLMs) related to the naming of actions, we reverse the action
+names in half of the experiments.
+
+We consider three types of beliefs:
+- an *implicit belief*, where the optimal action must be deduced from  
+  the natural language description of the payoff matrix;
+- an *explicit belief*, based on the analysis of player 2's actions, meaning that 
+the fact that B is strictly dominated by A is provided in the  prompt;
+- a *given belief*, where the optimal action for player 1 is  explicitly given in the prompt.
+We first evaluate the rationality of the agents and then their second-order rationality.
+
+
+### First Order Rationality
+
+Table below evaluates the models’ ability to generate rational
+behaviour for Player 2.
+
+| **Model**          | **Generation** | **Given** | **Explicit** | **Implicit** |
+|--------------------|--------------|----------|------------|------------|
+| `gpt-4.5`         | strategy     | 1.00     | 1.00       | 1.00       |
+| `mistral-small`   | strategy     | 1.00     | 1.00       | 1.00       |
+| `llama3`          | strategy     | 0.50     | 0.50       | 0.50       |
+| `deepseek-r1`     | strategy     | -        | -          | -          |
+| **—**             | **—**        | **—**    | **—**      | **—**      |
+| `gpt-4.5`         | actions      | 1.00     | 1.00       | 1.00       |
+| `mistral-small`   | actions      | 1.00     | 1.00       | 0.87       |
+| `llama3`          | actions      | 1.00     | 0.90       | 0.17       |
+| `deepseek-r1`     | actions      | 0.83     | 0.57       | 0.60       |
+
+When generating strategies, GPT-4.5 and Mistral-Small exhibit
+rational behaviour, whereas Llama3 adopts a random strategy.
+DeepSeek-R1 fails to generate valid output. When generating actions,
+GPT-4.5 demonstrates its ability to make rational decisions, even with
+implicit beliefs. Mistral-Small outperforms other open-weight models.
+Llama3 struggles to infer optimal actions based solely on implicit
+beliefs. DeepSeek-R1 is not a good candidate for simulating
+rationality.
+
+
+### Second-Order Rationality
+
+To adjust the difficulty of optimal decision-making, we define four variants of
+the payoff matrix for player 1 in Table below: (a) the
+original configuration, (b) the reduction of the gap between the gains, (c) the
+increase in the gain for the bad choice Y, and (d) the decrease in the gain for
+the good choice X.
+
+| **Version**       | **a**          |          | **b**          |          | **c**          |          | **d**          |          |
+|------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------|
+| **Player 1 \ Player 2 (version)** | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   |
+| **X**           | 15       | 5        | 8        | 7        | 6        | 5        | 15       | 5        |
+| **Y**           | 0        | 10       | 7        | 8        | 0        | 10       | 0        | 40       |
+
+
+Table below evaluates the models' ability to generate second-order
+rational behaviour for player 1. 
+
+When the models generate strategies, GPT-4.5 exhibits second-order
+rational behaviour in configurations (a), (c), and (d), but fails in
+configuration (b) to distinguish the optimal action from a nearly optimal one.
+Llama3 makes its decision randomly. Mistral-Small shows strong
+capabilities in generating second-order rational behaviour. DeepSeek-R1
+does not produce valid responses.
+
+When generating actions, Llama3 adapts to different types of beliefs
+and adjustments in the payoff matrix. GPT-4.5 performs well in the
+initial configuration (a), but encounters significant difficulties when the
+payoff structure changes (b, c, d), particularly with implicit beliefs. Although
+Mistral-Small works well with given or explicit beliefs, it faces
+difficulties with implicit beliefs, especially in variant (d).
+DeepSeek-R1 does not appear to be a good candidate for simulating
+second-order rationality.
+
+| **Version**  |                | **a**          |          |          | **b**          |          |          | **c**          |          |          | **d**          |          |          |
+|-------------|----------------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|
+| **Model**   | **Generation**  | **Given**   | **Explicit** | **Implicit** | **Given**   | **Explicit** | **Implicit** | **Given**   | **Explicit** | **Implicit** | **Given**   | **Explicit** | **Implicit** |
+| **gpt-4.5** | strategy       | 1.00       | 1.00       | 1.00       | 0.00       | 0.00       | 0.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       |
+| **llama3**  | strategy       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       | 0.50       |
+| **mistral-small** | strategy | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       | 1.00       |
+| **deepseek-r1**  | strategy | -          | -          | -          | -          | -          | -          | -          | -          | -          | -          | -          | -          |
+| **gpt-4.5** | actions        | 1.00       | 1.00       | 1.00       | 1.00       | 0.67       | 0.00       | 0.86       | 0.83       | 0.00       | 0.50       | 0.90       | 0.00       |
+| **llama3**  | actions        | 0.97       | 1.00       | 1.00       | 0.77       | 0.80       | 0.60       | 0.97       | 0.90       | 0.93       | 0.83       | 0.90       | 0.60       |
+| **mistral-small** | actions | 0.93       | 0.97       | 1.00       | 0.87       | 0.77       | 0.60       | 0.77       | 0.60       | 0.70       | 0.73       | 0.57       | 0.37       |
+| **deepseek-r1**  | actions  | 0.80       | 0.53       | 0.57       | 0.67       | 0.60       | 0.53       | 0.67       | 0.63       | 0.47       | 0.70       | 0.50       | 0.57       |
+
+Irrational decisions are explained by inference errors based on the natural
+language description of the payoff matrix. For example, in variant (d), the
+Mistral-Small model with given beliefs justifies its poor decision as
+follows: "Since player 2 is rational and A strictly dominates B, player 2 will
+choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
+(40). Therefore, choosing Y maximizes my gain."
+
+In summary, the results indicate that GPT-4.5 and 
+Mistral-Small generally adopt first- and second-order rational
+behaviours. However, GPT-4.5 struggles to distinguish an optimal action
+from a nearly optimal one, while Mistral-Small encounters difficulties
+with implicit beliefs. Llama3 generates strategies randomly but adapts
+better when producing specific actions. In contrast, DeepSeek-R1 fails
+to provide valid strategies and generates irrational actions.
+
+
+## Beliefs
+
+Beliefs — whether implicit, explicit, or
+given — are crucial for an autonomous agent's decision-making process. They
+allow for anticipating the actions of other agents.
+
+To assess the agents' ability to refine their beliefs in predicting their
+interlocutor's next action, we consider a simplified version of the
+Rock-Paper-Scissors (RPS) game where:
+- the opponent follows a hidden strategy, i.e., a repetition model;
+- the player must predict the opponent's next move (Rock, Paper, or Scissors);
+- a correct prediction earns 1 point, while an incorrect one earns 0 points;
+- the game can be played for $N$ rounds, and the player's accuracy is  evaluated at each round.
+
+For our experiments, we consider three simple models for the opponent where:
+- the actions remain constant in the form of R, S, or P, respectively;
+- the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
+- the opponent's actions follow a three-step loop model (R-P-S).
+We evaluate the models' ability to identify these behavioural patterns by
+calculating the average number of points earned per round.
+
+Figures presents the average points earned per round and the
+95\% confidence interval for each LLM against the three opponent behaviour
+models in the simplified version of the RPS game, whether the LLM generates a
+strategy or one-shot actions. We observe that the performance of LLMs in action
+generation, except for Mistral-Small when facing a constant strategy,
+is barely better than a random strategy. The strategies generated by the
+GPT-4.5 and Mistral-Small models predict the opponent's next
+move based on previous rounds by identifying the most frequently played move.
+While these strategies are effective against an opponent with a constant
+behavior, they fail to predict the opponent's next move when the latter adopts a
+more complex model. Neither Llama3 nor DeepSeek-R1 were able
+to generate a valid strategy.

 ![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)

@@ -239,36 +286,18 @@ adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to gener

 ![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)

+To assess the agents’ ability to factor the prediction of their opponent’s next
+move into their decision-making, we analyse their performance of each generative
+agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
+and a loss 0 points.

-## From belief to action
-
-To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally 
-based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.
-
-RPS is a simultaneous, zero-sum game for two players. 
-The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; 
-and if both players take the same action, the game is a tie. Scoring is as follows: 
-a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
-
-The objective in R-P-S is straightforward: win by selecting the optimal action 
-based on the opponent’s move. Since the rules are simple and deterministic, 
-LLMs can always make the correct choice. Therefore, RPS serves as a tool to
-assess an LLM’s ability to identify and capitalize on patterns in an opponent’s 
-non-random behavior.
-
-For a fine-grained analysis of the ability  of LLMs to identify
-opponent’s patterns, we set up 3 simple opponent’s patterns:
-1. the opponent’s actions remaining constant as R, S, and P, respectively;
-2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
-3. the opponent’s actions looping in a 3-step pattern (R-P-S).
-
-We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) 
-in identifying these patterns by calculating the average points earned per round.
-The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
-
-The figures below present the average points earned per round for each model against 
-the three opponent’s patterns. The 95% confidence interval is also shown.
-We observe that the performance of LLMs is barely better than that of a random strategy.
+Figures below  illustrates the average points earned per round along with
+the 95 % confidence interval for each LLM when facing constant strategies,
+whether the model generates a full strategy or one-shot actions. The results
+show that LLMs’ performance in action generation against a constant strategy is
+only marginally better than a random strategy. While Mistral-Small can
+accurately predict its opponent’s move, it fails to integrate this belief into
+its decision-making process.

 ![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)

@@ -276,7 +305,22 @@ We observe that the performance of LLMs is barely better than that of a random s

 ![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)

-
+In summary, generative autonomous agents struggle to anticipate or effectively
+incorporate other agents’ actions into their decision-making.
+
+## Synthesis
+
+Our results show that GPT-4.5, Llama3, and
+Mistral-Small generally respect preferences but encounter more
+difficulties in generating one-shot actions than in producing strategies in the
+form of algorithms. GPT-4.5 and Mistral-Small generally adopt
+rational behaviours of both first and second order, whereas Llama3,
+despite generating random strategies, adapts better when producing one-shot
+actions. In contrast, DeepSeek-R1 fails to develop valid strategies and
+performs poorly in generating actions that align with preferences or rationality
+principles. More critically, all the LLMs we evaluated struggle both to
+anticipate other agents’ actions or to integrate them effectively into their
+decision-making process.

 ## Authors