diff --git a/README.md b/README.md index 6b6d7635d4a6f6f7e0784f8e0519b4f9d936bdbf..3e933b4cd0c3c89dfb94cabf674be048448a10da 100644 --- a/README.md +++ b/README.md @@ -3,235 +3,282 @@ Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents. -## Dictator Game - -The dictator game is a classic game which is used to analyze players' personal preferences. -In this game, there are two players: the dictator and the recipient. Given two allocation options, -the dictator needs to take action, choosing one allocation, -while the recipient must accept the option chosen by the dictator. -Here, the dictator’s choice is considered to reflect the personal preference. - -### Default preferences - -The dictator’s choice reflect the LLM's preference. - -The figure below presents a violin plot depicting the share of the total amount (\$100) -that the dictator allocates to themselves for each model. -The temperature is fixed at 0.7, and each experiment was conducted 30 times. -The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50. -When we prompt the models to generate a strategy in the form of an algorithm implemented -in the Python programming language, rather than generating an action, all models divide -the amount fairly except for GPT-4.5, which takes approximately 70% of the total amount for itself. -It is worth noticing that, under these standard conditions, humans typically keep an average of around \$80 -(Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions -in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words, -this intra-model variability can be used to simulate the diversity of human behaviors based on -their experiences, preferences, or context. +This prototype allows to analyse the potential of Large Language Models (LLMs) for +social simulation by assessing their ability to: (a) make decisions aligned +with explicit preferences; (b) adhere to principles of rationality; and (c) +refine their beliefs to anticipate the actions of other agents. Through +game-theoretic experiments, we show that certain models, such as +\texttt{GPT-4.5} and \texttt{Mistral-Small}, exhibit consistent behaviours in +simple contexts but struggle with more complex scenarios requiring +anticipation of other agents' behaviour. Our study outlines research +directions to overcome the current limitations of LLMs. + + +## Preferences + +To analyse the behaviour of generative agents based on their preferences, we +rely on the dictator game. This variant of the ultimatum game features a single +player, the dictator, who decides how to distribute an endowment (e.g., a sum of +money) between themselves and a second player, the recipient. The dictator has +complete freedom in this allocation, while the recipient, having no influence +over the outcome, takes on a passive role. + +First, we evaluate the choices made by LLMs when playing the role of the +dictator, considering these decisions as a reflection of their intrinsic +preferences. Then, we subject them to specific instructions incorporating +preferences to assess their ability to consider them in their decisions. + +### Preference Elicitation + +Here, we consider that the choice of an LLM as a dictator reflects its intrinsic +preferences. Each LLM was asked to directly produce a one-shot action in the +dictator game. Additionally, we also asked the models to generate a strategy in +the form of an algorithm implemented in the Python language. In all our +experiments, one-shot actions are repeated 30 times, and the models' temperature +is set to 0.7 + +Figure below presents a violin plot illustrating the share of the +total amount (100) that the dictator allocates to themselves for each model. +The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 +through one-shot decisions is 50. + + + + +When we ask the models to generate a strategy rather than a one-shot action, all +models distribute the amount equally, except GPT-4.5, which retains +about 70 % of the total amount. Interestingly, under these standard +conditions, humans typically keep 80 on average. *[Fairness in Simple Bargaining Experiments](https://doi.org/10.1006/game.1994.1021)* Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. Games and Economic Behavior, 6(3), 347-369. 1994. - +Unlike the deterministic strategies generated by LLMs, the intra-model variability in +generated actions can be used to simulate the diversity of human behaviours based +on their experiences, preferences, or contexts. -The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates -to themselves as a function of temperature for each model, along with the 95% confidence interval. -Each experiment was conducted 30 times. It can be observed that temperature influences the variability -of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend, -whereas at high temperatures, the diversity of allocations increases, -reflecting a more random exploration of the available options. +Figure below illustrates the evolution of the dictator's share +as a function of temperature with a 95 % confidence interval when we ask each +models to generate decisions.  +Our sensitivity analysis of the temperature parameter reveals that the portion +retained by the dictator remains stable. However, the decisions become more +deterministic at low temperatures, whereas allocation diversity increases at +high temperatures, reflecting a more random exploration of available options. + ### Preference alignment -We define four preferences for the dictator: -1. She prioritizes her own interests, aiming to maximize her own income (selfish). -2. She prioritizes the other player’s interests, aiming to maximize their income (altruism). -3. She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian). -4. She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian). - -We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences: -1. The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish). -2. The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism). -3. The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian) -4. The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian) - -The following table presents the accuracy of the dictator's decision for each model and preference, -regardless of whether the models were prompted to generate a strategy or specific actions. -The temperature is set to 0.7, and each experiment involving action generation was repeated 30 times. - -| *Model* | *Generation* | *SELFISH* | *ALTRUISTIC* | *UTILITARIAN* | *EGALITARIAN* | -|-----------------| ------------- | ------------- | -------------- | ---------------- | ---------------- | -| *gpt-4.5* | *strategy* | 1.00 | 1.00 | 1.00 | 1.00 | -| *llama3* | *actions* | 1.00 | 1.00 | 1.00 | 1.00 | -| *mistral-small* | *actions* | 1.00 | 1.00 | 1.00 | 1.00 | -| *deepseek-r1 | *actions* | - | - | - | - | -|-----------------|---------------|---------------|----------------|------------------|------------------| -| *gpt-4.5* | *actions* | 1.00 | 1.00 | 0.50 | 1.00 | -| *llama3* | *actions* | 1.00 | 0.90 | 0.40 | 0.73 | -| *mistral-small* | *actions* | 0.40 | 0.93 | 0.76 | 0.16 | -| *deepseek-r1 | *actions* | 0.06 | 0.20 | 0.76 | 0.03 | - - -This table helps assess the models’ ability to align with different preferences. -When models are explicitly prompted to generate strategies, -they exhibit perfect alignment with the predefined preferences except for DeepSeek-R1, -which does not generate valid code. -When models are prompted to generate actions, GPT-4.5 consistently aligns well across all preferences -but struggles with utilitarianism when generating actions. -Llama3 performs well for selfish and altruistic preferences but shows weaker alignment for -utilitarian and egalitarian choices. -Mistral-small aligns best with altruistic preferences and maintains moderate performance on utilitarianism, -but struggles with selfish and egalitarian preferences. -Deepseek-r1 performs best for utilitarianism but has poor accuracy in other categories. - -Bad action selections can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300) -or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a -relatively equal split with the other player’). +We define four preferences for the dictator, each corresponding to a distinct form of social welfare: + +1. **Egoism** maximizes the dictator’s income. +2. **Altruism** maximizes the recipient’s income. +3. **Utilitarianism** maximizes total income. +4. **Egalitarianism** maximizes the minimum income between the players. + +We consider four allocation options where part of the money is lost in the division process, +each corresponding to one of the four preferences: + +- The dictator keeps **$500**, the recipient receives **$100**, and a total of **$400** is lost (**egoistic**). +- The dictator keeps **$100**, the recipient receives **$500**, and **$400** is lost (**altruistic**). +- The dictator keeps **$400**, the recipient receives **$300**, resulting in a loss of **$300** (**utilitarian**). +- The dictator keeps **$325**, the other player receives **$325**, and **$350** is lost (**egalitarian**). + +Table below evaluates the ability of the models to align with different preferences. +- When generating **strategies**, the models align perfectly with preferences, except for **`DeepSeek-R1`**, which does not generate valid code. +- When generating **actions**, **`GPT-4.5`** aligns well with preferences but struggles with **utilitarianism**. +- **`Llama3`** aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices. +- **`Mistral-Small`** aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences. +- **`DeepSeek-R1`** primarily aligns with **utilitarianism** but has low accuracy in other preferences. + +| **Model** | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** | +|---------------------|---------------|-------------|---------------|---------------|---------------| +| **`GPT-4.5`** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | +| **`Llama3`** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | +| **`Mistral-Small`**| **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | +| **`DeepSeek-R1`** | **Strategy** | - | - | - | - | +| **`GPT-4.5`** | **Actions** | 1.00 | 1.00 | 0.50 | 1.00 | +| **`Llama3`** | **Actions** | 1.00 | 0.90 | 0.40 | 0.73 | +| **`Mistral-Small`**| **Actions** | 0.40 | 0.93 | 0.76 | 0.16 | +| **`DeepSeek-R1`** | **Actions** | 0.06 | 0.20 | 0.76 | 0.03 | + +Errors in action selection may stem from either arithmetic miscalculations +(e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or +misinterpretations of preferences. For example, the model `DeepSeek-R1`, +adopting utilitarian preferences, justifies its choice by stating, "I think +fairness is key here". + +In summary, our results indicate that the models `GPT-4.5`, +`Llama3`, and `Mistral-Small` generally align well with +preferences but have more difficulty generating individual actions than +algorithmic strategies. In contrast, `DeepSeek-R1` does not generate +valid strategies and performs poorly when generating specific actions. ## Rationality -An autonomous agent is rational if she plays a best response to her beliefs. -She satisfies second-order rationality if she is rational and also believes that others are rational. -In other words, a second-order rational agent not only considers the best course of action for herself -but also anticipates how others make their decisions. +An autonomous agent is rational if it chooses the optimal action based on its +beliefs. This agent satisfies second-order rationality if it is rational and +believes that other agents are rational. In other words, a second-order rational +agent does not only consider the best choice for itself but also anticipates how +others make their decisions. Experimental game theory studies show that 93 % of +human subjects are rational, while 71 % exhibit second-order +rationality. -To assess players’ first- and second-order rationality, we consider a simplified version of the -ring-network game introduced by Kneeland (2015). His experiments conduct by Kneeland (2015) -demonstrate that 93% of the subjects are rational, while 71% exhibit second-order rationality. +Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar- +gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994), +https://doi.org/10.1006/game.1994.1021 -**[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)** -Terri Kneeland (2015) Published in *Econometrica*, Volume 83, Issue 5, Pages 2065-2079 -DOI: [10.3982/ECTA11983](https://doi.org/10.3982/ECTA11983) - -This game features two players, each with two available strategies, where -both players aim to maximize their own payoff. -The corresponding payoff matrix is shown below: +To evaluate the first- and second-order rationality of generative autonomous +agents, we consider a simplified version of the ring-network game, +which involves two players seeking to maximize their own payoff. Each player has +two available actions, and the payoff matrix is presented below/ | Player 1 \ Player 2 | Strategy A | Strategy B | |---------------------|------------|-----------| | **Strategy X** | (15,10) | (5,5) | | **Strategy Y** | (0,5) | (10,0) | - -If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold). -If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and -Y is the best response if she believes Player 2 will play B. -If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X. -This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and -since X is the best response to A, Player 1 will choose X. - -We establish three types of belief: -- *implicit* belief: The optimal action must be inferred from the natural language description of the payoff matrix. -- *explicit* belief: This belief focuses on analyzing Player 2’s actions, where Strategy B is strictly dominated by Strategy A. -- *given* belief: The optimal action for Player 1 is explicitly stated in the prompt. - -We set up three forms of belief: -- *implicit* belief where the optimal action must be deduced from the description - of the payoff matrix in natural language; -- *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A). -- *given* belief* where optimal action of Player 1is explicitly provided in the prompt; - -### First order rationality - -The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1. -The results indicate how well each model performs under each belief type. - -| *Model* | *Generation* | *Given* | *Explicit* | *Implicit* | -|-----------------|--------------|---------|------------|------------| -| *gpt-4.5* | *strategy* | 1.00 | 1.00 | 1.00 | -| *mistral-small* | *strategy* | 1.00 | 1.00 | 1.00 | -| *llama3* | *strategy* | 0.5 | 0.5 | 0.5 | -| *deepseek-r1* | *strategy* | - | - | - | -| *gpt-4.5* | *actions* | 1.00 | 1.00 | 1.00 | -| *mistral-small* | *actions* | 1.00 | 1.00 | 0.87 | -| *llama3* | *actions* | 1.00 | 0.90 | 0.17 | -| *deepseek-r1* | *actions* | 0.83 | 0.57 | 0.60 | - -When the models generate strategies instead of selecting individual actions, GPT-4.5 and -Mistral-Small exhibit a rational behaviour while Llama3 use a random strategy. -DeepSeek-R1 does not generate valid code. -When the models generates individual actions instead of a strategy, -GPT-4.5 achieves a perfect score across all belief types, -demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition. -Mistral-Small consistently outperforms the other open-weight models across all belief types. -Its strong performance with implicit belief indicates that it can effectively -deduce the optimal action from the payoff matrix description. -Llama3 performs well with a given belief, but significantly underperforms with an implicit belief, -suggesting it may struggle to infer optimal actions solely from natural language descriptions. -DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs, -indicating it may not be a good candidate to simulate rationality as the other models. - -### Second-order rationality - -In order to adjust the difficulty of taking the optimal -action, we consider 4 versions of the player’s payoff matrix: -- a. is the original setup; -- b. we reduce the difference in payoffs; -- c. we increase the expected payoff for the incorrect choice Y -- d. we decrease the expected payoff for the correct choice X. - -| **Action \ Opponent Action (version)** | **A(a)** | **B(a)** | | **A(b)** | **B(b)** | | **A(c)** | **B(c)** | | **A(d)** | **B(d)** | -|----------------------------------------|----------|----------|-|----------|----------|-|----------|----------|-|----------|----------| -| **X** | 15 | 5 | | 8 | 7 | | 6 | 5 | | 15 | 5 | -| **Y** | 0 | 10 | | 7 | 8 | | 0 | 10 | | 0 | 40 | - - - -| Model | Generation | Given (a) | Explicit (a) | Implicit (a) | | Given (b) | Explicit (b) | Implicit (b) | | Given (c) | Explicit (c) | Implicit (c) | | Given (d) | Explicit (d) | Implicit (d) | -|---------------|--------------|-------------|----------------|----------------|-|-------------|----------------|----------------|--|-------------|----------------|----------------|--|-------------|----------------|----------------| -| gpt4-.5 | strategy | 1.00 | 1.00 | 1.00 | | 0.00 | 0.00 | 0.00 | | 1.00 | 1.OO | 1.00 | | 1.00 | 1.00 | 1.00 | -| llama3 | strategy | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 | | 0.50 | 0.50 | 0.50 | -| mistral-small | strategy | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 | | 1.00 | 1.00 | 1.00 | -| deepseek-r1 | strategy | - | - | - | | - | - | - | | - | - | - | | - | - | - | -|---------------| ------------ | ----------- | -------------- | -------------- |-| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- |--| ----------- | -------------- | -------------- | -| gpt4-.5 | actions | 1.00 | 1.00 | 1.00 | | 1.00 | 0.67 | 0.00 | | 0.86 | 0.83 | 0.00 | | 0.50 | 0.90 | 0.00 | -| llama3 | actions | 0.97 | 1.00 | 1.00 | | 0.77 | 0.80 | 0.60 | | 0.97 | 0.90 | 0.93 | | 0.83 | 0.90 | 0.60 | -| mistral-small | actions | 0.93 | 0.97 | 1.00 | | 0.87 | 0.77 | 0.60 | | 0.77 | 0.60 | 0.70 | | 0.73 | 0.57 | 0.37 | -| deepseek-r1 | actions | 0.80 | 0.53 | 0.57 | | 0.67 | 0.60 | 0.53 | | 0.67 | 0.63 | 0.47 | | 0.70 | 0.50 | 0.57 | - - -When the model generate strategies, GPT-4.5 performs perfectly in the setups (a), (c) and (b) but -fails in setup (b) in differentiating the optimal strategy from a near-optimal one. -Llama3 adopt a random approach to decision-making rather than a structured understanding of rationality. -Mistral-Small consistently achieves a 100% success rate across all setups, demonstrating robust reasoning abilities. -DeepSeek-R1 does not produce valid responses, further reinforcing that it may not be a viable candidate -for generating rational strategies. - -When they generates individual actions, GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief -when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward, -it is confused by the altered payoffs. -LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types -and adjusted payoff matrices. -Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d). -DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality. - - -## Belief - -In order to evaluate the ability of LLMs to refine belief by predicting the opponent’s next move, -we consider a simplified version of the Rock-Paper-Scissors game. - -Rules: -1. The opponent follows a hidden strategy (repeating pattern). -2. The player must predict the opponent’s next move (Rock, Paper, or Scissors). -3. A correct guess earns 1 point, and an incorrect guess earns 0 points. -4. The game can run for N rounds, and the player’s accuracy is evaluated at the each round. - -We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1) -in identifying these patterns by calculating the average points earned per round. -The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times. - -The figures below present the average points earned per round for each model against -the three opponent’s patterns regardless of whether the models were prompted to generate -a strategy or specific actions. The 95% confidence interval is also shown. -We find that the action generation performance of LLMs, whether proprietary or open-weight, is -only marginally better than a random strategy. -The strategies generated by the model GPT-4.5 and Mistral-Small predicts the opponent’s next -move based on past rounds by identifying the most frequently move by the opponent. While this strategy -is effective against the constant behavior, it fails to predict the opponent’s next move when the opponent -adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy. +If Player 2 is rational, they must choose A because B is strictly dominated. If +Player 1 is rational, they may choose either X or Y: X is the best response if +Player 1 believes that Player 2 will choose A, while Y is the best response if +Player 1 believes that Player 2 will choose B. If Player 1 satisfies +second-order rationality, they must play X. To neutralize biases in large +language models (LLMs) related to the naming of actions, we reverse the action +names in half of the experiments. + +We consider three types of beliefs: +- an *implicit belief*, where the optimal action must be deduced from + the natural language description of the payoff matrix; +- an *explicit belief*, based on the analysis of player 2's actions, meaning that +the fact that B is strictly dominated by A is provided in the prompt; +- a *given belief*, where the optimal action for player 1 is explicitly given in the prompt. +We first evaluate the rationality of the agents and then their second-order rationality. + + +### First Order Rationality + +Table below evaluates the models’ ability to generate rational +behaviour for Player 2. + +| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | +|--------------------|--------------|----------|------------|------------| +| `gpt-4.5` | strategy | 1.00 | 1.00 | 1.00 | +| `mistral-small` | strategy | 1.00 | 1.00 | 1.00 | +| `llama3` | strategy | 0.50 | 0.50 | 0.50 | +| `deepseek-r1` | strategy | - | - | - | +| **—** | **—** | **—** | **—** | **—** | +| `gpt-4.5` | actions | 1.00 | 1.00 | 1.00 | +| `mistral-small` | actions | 1.00 | 1.00 | 0.87 | +| `llama3` | actions | 1.00 | 0.90 | 0.17 | +| `deepseek-r1` | actions | 0.83 | 0.57 | 0.60 | + +When generating strategies, GPT-4.5 and Mistral-Small exhibit +rational behaviour, whereas Llama3 adopts a random strategy. +DeepSeek-R1 fails to generate valid output. When generating actions, +GPT-4.5 demonstrates its ability to make rational decisions, even with +implicit beliefs. Mistral-Small outperforms other open-weight models. +Llama3 struggles to infer optimal actions based solely on implicit +beliefs. DeepSeek-R1 is not a good candidate for simulating +rationality. + + +### Second-Order Rationality + +To adjust the difficulty of optimal decision-making, we define four variants of +the payoff matrix for player 1 in Table below: (a) the +original configuration, (b) the reduction of the gap between the gains, (c) the +increase in the gain for the bad choice Y, and (d) the decrease in the gain for +the good choice X. + +| **Version** | **a** | | **b** | | **c** | | **d** | | +|------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------| +| **Player 1 \ Player 2 (version)** | **A** | **B** | **A** | **B** | **A** | **B** | **A** | **B** | +| **X** | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 | +| **Y** | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 | + + +Table below evaluates the models' ability to generate second-order +rational behaviour for player 1. + +When the models generate strategies, GPT-4.5 exhibits second-order +rational behaviour in configurations (a), (c), and (d), but fails in +configuration (b) to distinguish the optimal action from a nearly optimal one. +Llama3 makes its decision randomly. Mistral-Small shows strong +capabilities in generating second-order rational behaviour. DeepSeek-R1 +does not produce valid responses. + +When generating actions, Llama3 adapts to different types of beliefs +and adjustments in the payoff matrix. GPT-4.5 performs well in the +initial configuration (a), but encounters significant difficulties when the +payoff structure changes (b, c, d), particularly with implicit beliefs. Although +Mistral-Small works well with given or explicit beliefs, it faces +difficulties with implicit beliefs, especially in variant (d). +DeepSeek-R1 does not appear to be a good candidate for simulating +second-order rationality. + +| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | | +|-------------|----------------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------| +| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | +| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | +| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | +| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | +| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | +| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 | +| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 | +| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | +| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 | + +Irrational decisions are explained by inference errors based on the natural +language description of the payoff matrix. For example, in variant (d), the +Mistral-Small model with given beliefs justifies its poor decision as +follows: "Since player 2 is rational and A strictly dominates B, player 2 will +choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y +(40). Therefore, choosing Y maximizes my gain." + +In summary, the results indicate that GPT-4.5 and +Mistral-Small generally adopt first- and second-order rational +behaviours. However, GPT-4.5 struggles to distinguish an optimal action +from a nearly optimal one, while Mistral-Small encounters difficulties +with implicit beliefs. Llama3 generates strategies randomly but adapts +better when producing specific actions. In contrast, DeepSeek-R1 fails +to provide valid strategies and generates irrational actions. + + +## Beliefs + +Beliefs — whether implicit, explicit, or +given — are crucial for an autonomous agent's decision-making process. They +allow for anticipating the actions of other agents. + +To assess the agents' ability to refine their beliefs in predicting their +interlocutor's next action, we consider a simplified version of the +Rock-Paper-Scissors (RPS) game where: +- the opponent follows a hidden strategy, i.e., a repetition model; +- the player must predict the opponent's next move (Rock, Paper, or Scissors); +- a correct prediction earns 1 point, while an incorrect one earns 0 points; +- the game can be played for $N$ rounds, and the player's accuracy is evaluated at each round. + +For our experiments, we consider three simple models for the opponent where: +- the actions remain constant in the form of R, S, or P, respectively; +- the opponent's actions follow a two-step loop model (R-P, P-S, S-R); +- the opponent's actions follow a three-step loop model (R-P-S). +We evaluate the models' ability to identify these behavioural patterns by +calculating the average number of points earned per round. + +Figures presents the average points earned per round and the +95\% confidence interval for each LLM against the three opponent behaviour +models in the simplified version of the RPS game, whether the LLM generates a +strategy or one-shot actions. We observe that the performance of LLMs in action +generation, except for Mistral-Small when facing a constant strategy, +is barely better than a random strategy. The strategies generated by the +GPT-4.5 and Mistral-Small models predict the opponent's next +move based on previous rounds by identifying the most frequently played move. +While these strategies are effective against an opponent with a constant +behavior, they fail to predict the opponent's next move when the latter adopts a +more complex model. Neither Llama3 nor DeepSeek-R1 were able +to generate a valid strategy.  @@ -239,36 +286,18 @@ adopts a more complex pattern. Neither Llama3 nor DeepSeek-R1 were able to gener  +To assess the agents’ ability to factor the prediction of their opponent’s next +move into their decision-making, we analyse their performance of each generative +agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, +and a loss 0 points. -## From belief to action - -To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally -based on their prediction, we consider the Rock-Paper-Scissors (RPS) game. - -RPS is a simultaneous, zero-sum game for two players. -The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; -and if both players take the same action, the game is a tie. Scoring is as follows: -a win earns 2 points, a tie earns 1 point, and a loss earns 0 points. - -The objective in R-P-S is straightforward: win by selecting the optimal action -based on the opponent’s move. Since the rules are simple and deterministic, -LLMs can always make the correct choice. Therefore, RPS serves as a tool to -assess an LLM’s ability to identify and capitalize on patterns in an opponent’s -non-random behavior. - -For a fine-grained analysis of the ability of LLMs to identify -opponent’s patterns, we set up 3 simple opponent’s patterns: -1. the opponent’s actions remaining constant as R, S, and P, respectively; -2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R); -3. the opponent’s actions looping in a 3-step pattern (R-P-S). - -We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) -in identifying these patterns by calculating the average points earned per round. -The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times. - -The figures below present the average points earned per round for each model against -the three opponent’s patterns. The 95% confidence interval is also shown. -We observe that the performance of LLMs is barely better than that of a random strategy. +Figures below illustrates the average points earned per round along with +the 95 % confidence interval for each LLM when facing constant strategies, +whether the model generates a full strategy or one-shot actions. The results +show that LLMs’ performance in action generation against a constant strategy is +only marginally better than a random strategy. While Mistral-Small can +accurately predict its opponent’s move, it fails to integrate this belief into +its decision-making process.  @@ -276,7 +305,22 @@ We observe that the performance of LLMs is barely better than that of a random s  - +In summary, generative autonomous agents struggle to anticipate or effectively +incorporate other agents’ actions into their decision-making. + +## Synthesis + +Our results show that GPT-4.5, Llama3, and +Mistral-Small generally respect preferences but encounter more +difficulties in generating one-shot actions than in producing strategies in the +form of algorithms. GPT-4.5 and Mistral-Small generally adopt +rational behaviours of both first and second order, whereas Llama3, +despite generating random strategies, adapts better when producing one-shot +actions. In contrast, DeepSeek-R1 fails to develop valid strategies and +performs poorly in generating actions that align with preferences or rationality +principles. More critically, all the LLMs we evaluated struggle both to +anticipate other agents’ actions or to integrate them effectively into their +decision-making process. ## Authors