# PyGAAMAS Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents. This prototype explores the potential of *homo silicus* for social simulation. We examine the behaviour exhibited by intelligent machines, particularly how generative agents deviate from the principles of rationality. To assess their responses to simple human-like strategies, we employ a series of tightly controlled and theoretically well-understood games. Through behavioral game theory, we evaluate the ability of <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Mistral-Small</tt>}, and <tt>DeepSeek-R1</tt> to make coherent one-shot decisions, generate algorithmic strategies based on explicit preferences, adhere to first- and second-order rationality principles, and refine their beliefs in response to other agents’ behaviours. ## Economic Rationality To evaluate the economic rationality of various LLMs, we introduce an investment game designed to test whether these models follow stable decision-making patterns or react erratically to changes in the game’s parameters. In this game, an investor allocates a basket $x_t=(x^A_t, x^B_t)$ of $100$ points between two assets: Asset A and Asset B. The value of these points depends on random prices $p_t=(p_{t}^A, p_t^B)$, which determine the monetary return per allocated point. For example, if $p_t^A= 0.8$ and $p_t^B = 0.8$, each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. T he game is played $25$ times to assess the consistency of the investor’s decisions. To evaluate the rationality of the decisions, we use Afriat's critical cost efficiency index (CCEI), i.e. a widely used measure in experimental economics. The CCEI assesses whether choices adhere to the generalized axiom of revealed preference (GARP), a fundamental principle of rational decision-making. If an individual violates rational choice consistency, the CCEI determines the minimal budget adjustment required to make their decisions align with rationality. Mathematically, the budget for each basket is calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is derived from observed decisions by solving a linear optimization problem that finds the largest $\lambda$, where $0 \leq \lambda \leq 1$, such that for every observation, the adjusted decisions satisfy the rationality constraint: $p_t \cdot x_t \leq \lambda I_t$. This means that if we slightly reduce the budget, multiplying it by $\lambda$, the choices will become consistent with rational decision-making. A CCEI close to 1 indicates high rationality and consistency with economic theory. A low CCEEI suggests irrational or inconsistent decision-making. To ensure response consistency, each model undergoes $30$ iterations of the game with a fixed temperature of $0.0$. The results shown in Figure below highlight significant differences in decision-making consistency among the evaluated models. <tt>GPT-4.5</tt>, <tt>LLama3.3:latest</tt> and <tt>DeepSeek-R1:7b</tt> stand out with a perfect CCEI score of 1.0, indicating flawless rationality in decision-making. <tt>Mistral-Small</tt> and <tt>Mixtral:8x7b</tt> demonstrate the next highest level of rationality. <tt>Llama3</tt> performs moderately well, with CCEI values ranging between 0.2 and 0.74. <tt>DeepSeek-R1</tt> exhibits inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.  ## Preferences To analyse the behaviour of generative agents based on their preferences, we rely on the dictator game. This variant of the ultimatum game features a single player, the dictator, who decides how to distribute an endowment (e.g., a sum of money) between themselves and a second player, the recipient. The dictator has complete freedom in this allocation, while the recipient, having no influence over the outcome, takes on a passive role. First, we evaluate the choices made by LLMs when playing the role of the dictator, considering these decisions as a reflection of their intrinsic preferences. Then, we subject them to specific instructions incorporating preferences to assess their ability to consider them in their decisions. ### Preference Elicitation Here, we consider that the choice of an LLM as a dictator reflects its intrinsic preferences. Each LLM is asked to directly produce a one-shot action in the dictator game. Additionally, we also asked the models to generate a strategy in the form of an algorithm implemented in the <tt>Python</tt> language. In all our experiments, one-shot actions are repeated 30 times, and the models' temperature is set to $0.7$. Newt Figure presents a violin plot illustrating the share of the total amount (\$100) that the dictator allocates to themselves for each model. The median share taken by <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Mistral-Small</tt>, and <tt>DeepSeek-R1</tt> through one-shot decisions is \$50, likely due to a corpus-based biases like term frequency. The median share taken by <tt>mixtral:8x7b</tt>, <tt>Llama3.3:latest</tt>, is \$60. When we ask the models to generate a strategy rather than a one-shot action, all models distribute the amount equally, except <tt>GPT-4.5</tt>, which retains about $70\%$ of the total amount. Interestingly, under these standard conditions, humans typically keep \$80 on average. When the role assigned to the model is that of a human rather than an assistant agent, only Llama3 deviates with a median share of \$60. Unlike the deterministic strategies generated by LLMs, the intra-model variability in generated actions can be used to simulate the diversity of human behaviours based on their experiences, preferences, or contexts.  Our sensitivity analysis of the temperature parameter reveals that the portion retained by the dictator remains stable. However, the decisions become more deterministic at low temperatures, whereas allocation diversity increases at high temperatures, reflecting a more random exploration of available options.  ### Preference alignment We define four preferences for the dictator, each corresponding to a distinct form of social welfare: 1. **Egoism** maximizes the dictator’s income. 2. **Altruism** maximizes the recipient’s income. 3. **Utilitarianism** maximizes total income. 4. **Egalitarianism** maximizes the minimum income between the players. We consider four allocation options where part of the money is lost in the division process, each corresponding to one of the four preferences: - The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (**egoistic**). - The dictator keeps **$100, the recipient receives $500, and $400 is lost (**altruistic**). - The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (**utilitarian**). - The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**). Table below evaluates the ability of the models to align with different preferences. - When generating **strategies**, the models align perfectly with preferences, except for - <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code. - When generating **actions**, - <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**. - <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices. - <tt>Mistral-Small</tt> aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences. - <tt>DeepSeek-R1</tt> primarily aligns with **utilitarianism** but has low accuracy in other preferences. While a larger LLM typically aligns better with preferences, a model like <tt>Mixtral-8x7B</tt> may occasionally underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity. Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters. If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance. | **Model** | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** | |------------------------------|----------------|--------------|----------------|-----------------|-----------------| | **<tt>GPT-4.5</tt>** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | | **<tt>Llama3.3:latest</tt>** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | | **<tt>Llama3</tt>** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | | **<tt>Mixtral:8x7b</tt>** | **Strategy** | - | - | - | - | | **<tt>Mistral-Small</tt>** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | | **<tt>DeepSeek-R1:7b</tt>** | **Strategy** | 1.00 | 1.00 | 1.00 | 1.00 | | **<tt>DeepSeek-R1</tt>** | **Strategy** | - | - | - | - | | **<tt>GPT-4.5<tt>** | **Actions** | 1.00 | 1.00 | 0.50 | 1.00 | | **<tt>Llama3.3:latest</tt>** | **Actions** | 1.00 | 1.00 | 0.43 | 0.96 | | **<tt>Llama3</tt>** | **Actions** | 1.00 | 0.90 | 0.40 | 0.73 | | **<tt>Mixtral:8x7b</tt>** | **Actions** | 0.00 | 0.00 | 0.30 | 1.00 | | **<tt>Mistral-Small</tt>** | **Actions** | 0.40 | 0.94 | 0.76 | 0.16 | | **<tt>DeepSeek-R1:7b</tt>** | **Actions** | 0.46 | 0.56 | 0.66 | 0.90 | | **<tt>DeepSeek-R1</tt>** | **Actions** | 0.06 | 0.20 | 0.76 | 0.03 | Errors in action selection may stem from either arithmetic miscalculations (e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or misinterpretations of preferences. For example, the model `DeepSeek-R1`, adopting utilitarian preferences, justifies its choice by stating, "I think fairness is key here". In summary, our results indicate that the models `GPT-4.5`, `Llama3`, and `Mistral-Small` generally align well with preferences but have more difficulty generating individual actions than algorithmic strategies. In contrast, `DeepSeek-R1` does not generate valid strategies and performs poorly when generating specific actions. ## Rationality An autonomous agent is rational if it chooses the optimal action based on its beliefs. This agent satisfies second-order rationality if it is rational and believes that other agents are rational. In other words, a second-order rational agent does not only consider the best choice for itself but also anticipates how others make their decisions. Experimental game theory studies show that 93 % of human subjects are rational, while 71 % exhibit second-order rationality. Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar- gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994), https://doi.org/10.1006/game.1994.1021 To evaluate the first- and second-order rationality of generative autonomous agents, we consider a simplified version of the ring-network game, which involves two players seeking to maximize their own payoff. Each player has two available actions, and the payoff matrix is presented below/ | Player 1 \ Player 2 | Strategy A | Strategy B | |---------------------|------------|-----------| | **Strategy X** | (15,10) | (5,5) | | **Strategy Y** | (0,5) | (10,0) | If Player 2 is rational, they must choose A because B is strictly dominated. If Player 1 is rational, they may choose either X or Y: X is the best response if Player 1 believes that Player 2 will choose A, while Y is the best response if Player 1 believes that Player 2 will choose B. If Player 1 satisfies second-order rationality, they must play X. To neutralize biases in large language models (LLMs) related to the naming of actions, we reverse the action names in half of the experiments. We consider three types of beliefs: - an *implicit belief*, where the optimal action must be deduced from the natural language description of the payoff matrix; - an *explicit belief*, based on the analysis of player 2's actions, meaning that the fact that B is strictly dominated by A is provided in the prompt; - a *given belief*, where the optimal action for player 1 is explicitly given in the prompt. We first evaluate the rationality of the agents and then their second-order rationality. ### First Order Rationality Table below evaluates the models’ ability to generate rational behaviour for Player 2. | **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | |-------------------|--------------|-----------|--------------|--------------| | <tt>gpt-4.5</tt> | strategy | 1.00 | 1.00 | 1.00 | | <tt>mixtral:8x7b</tt> | strategy | 1.00 | 1.00 | 1.00 | | <tt>mistral-small</tt> | strategy | 1.00 | 1.00 | 1.00 | | <tt>llama3.3:latest</tt> | strategy | 1.00 | 1.00 | 0.50 | | <tt>llama3</tt> | strategy | 0.50 | 0.50 | 0.50 | | <tt>deepseek-r1:7b</tt> | strategy | - | - | - | | <tt>deepseek-r1</tt> | strategy | - | - | - | | **—** | **—** | **—** | **—** | **—** | | <tt>gpt-4.5</tt> | actions | 1.00 | 1.00 | 1.00 | | <tt>mixtral:8x7b</tt> | actions | 1.00 | 1.00 | 1.00 | | <tt>mistral-small</tt> | actions | 1.00 | 1.00 | 0.87 | | <tt>llama33:latest</tt> | actions | 1.00 | 1.00 | 1.00 | | <tt>llama3.3</tt> | actions | 1.00 | 0.90 | 0.17 | | <tt>deepseek-r1:7b</tt> | actions | 1.00 | 1.00 | 1.00 | | <tt>deepseek-r1</tt> | actions | 0.83 | 0.57 | 0.60 | When generating strategies, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, and <tt>Mistral-Small</tt> exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality. <tt>Llama3.3:latest</tt> has the same behaviour with implicit beliefs. <tt>Deepseek-R1:7b</tt> and <tt>DeepSeek-R1</tt> fails to generate valid strategies. When generating actions, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, <tt>DeepSeek-R1:7b</tt>, and <tt>Llama3.3:latest<</tt> demonstrate strong rational decision-making, even with implicit beliefs. <tt>Mistral-Small</tt> performs well but slightly lags in handling implicit reasoning. <tt>Llama3</tt> struggles with implicit reasoning, while <tt>DeepSeek-R1</tt> shows inconsistent performance. Overall, <tt>GPT-4.5</tt> and <tt>Mixtral-8x7B</tt> are the most reliable models for generating rational behavior. ### Second-Order Rationality To adjust the difficulty of optimal decision-making, we define four variants of the payoff matrix for player 1 in Table below: (a) the original configuration, (b) the reduction of the gap between the gains, (c) the increase in the gain for the bad choice Y, and (d) the decrease in the gain for the good choice X. | **Version** | **a** | | **b** | | **c** | | **d** | | |------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------| | **Player 1 \ Player 2 (version)** | **A** | **B** | **A** | **B** | **A** | **B** | **A** | **B** | | **X** | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 | | **Y** | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 | Table below evaluates the models' ability to generate second-order rational behaviour for player 1. When the models generate strategies, GPT-4.5 exhibits second-order rational behaviour in configurations (a), (c), and (d), but fails in configuration (b) to distinguish the optimal action from a nearly optimal one. Llama3 makes its decision randomly. Mistral-Small shows strong capabilities in generating second-order rational behaviour. DeepSeek-R1 does not produce valid responses. When generating actions, Llama3 adapts to different types of beliefs and adjustments in the payoff matrix. GPT-4.5 performs well in the initial configuration (a), but encounters significant difficulties when the payoff structure changes (b, c, d), particularly with implicit beliefs. Although Mistral-Small works well with given or explicit beliefs, it faces difficulties with implicit beliefs, especially in variant (d). DeepSeek-R1 does not appear to be a good candidate for simulating second-order rationality. When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt> demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior. <tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs. <tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation. When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version, performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs. | **Version** | | **a** | | | **b** | | | **c** | | | **d** | | | |---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------| | **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | | **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **llama3.3:latest** | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | | **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | | **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 | | **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.2 | 1.00 | 1.00 | 0.00 | | **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 | | **mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 | | **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | | **deepseek-r1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 | | **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 | Irrational decisions are explained by inference errors based on the natural language description of the payoff matrix. For example, in variant (d), the Mistral-Small model with given beliefs justifies its poor decision as follows: "Since player 2 is rational and A strictly dominates B, player 2 will choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y (40). Therefore, choosing Y maximizes my gain." In summary, the results indicate that GPT-4.5 and Mistral-Small generally adopt first- and second-order rational behaviours. However, GPT-4.5 struggles to distinguish an optimal action from a nearly optimal one, while Mistral-Small encounters difficulties with implicit beliefs. Llama3 generates strategies randomly but adapts better when producing specific actions. In contrast, DeepSeek-R1 fails to provide valid strategies and generates irrational actions. In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit inconsistent and often irrational decision-making, failing to generate valid strategies in many cases. ## Beliefs Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents. ### Refine beliefs To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider a simplified version of the Rock-Paper-Scissors (RPS) game where: - the opponent follows a hidden strategy, i.e., a repetition model; - the player must predict the opponent's next move (Rock, Paper, or Scissors); - a correct prediction earns 1 point, while an incorrect one earns 0 points; - the game can be played for $N$ rounds, and the player's accuracy is evaluated at each round. For our experiments, we consider three simple models for the opponent where: - the actions remain constant in the form of R, S, or P, respectively; - the opponent's actions follow a two-step loop model (R-P, P-S, S-R); - the opponent's actions follow a three-step loop model (R-P-S). We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round. Figures present the average points earned per round and the 95% confidence interval for each LLM against the three opponent behavior models in a simplified version of the Rock-Paper-Scissors (RPS) game, whether the LLM generates a strategy or one-shot actions. Neither <tt>Llama3</tt> nor <tt>DeepSeek-R1</tt> were able to generate a valid strategy. <tt>DeepSeek-R1:7b</tt> was unable to generate either a valid strategy or consistently valid actions. The strategies generated by the <tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> models attempt to predict the opponent’s next move based on previous rounds by identifying the most frequently played move. While these strategies are effective against an opponent with a constant behavior, they fail to predict the opponent's next move when the latter adopts a more complex model. We observe that the performance of most LLMs in action generation— except for <tt>Llama3.3:latest</tt>, <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt> when facing a constant strategy—is barely better than a <tt>random</tt> strategy.       ### Assess Beliefs To assess the agents’ ability to factor the prediction of their opponent’s next move into their decision-making, we analyse their performance of each generative agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, and a loss 0 points. Figures below illustrates the average points earned per round along with the 95 % confidence interval for each LLM when facing constant strategies, when the model generates one-shot actions. Even if <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt> accurately predict its opponent’s move, they fails to integrate this belief into its decision-making process. Only <tt>Llama3.3:latest</tt> is capable of inferring the opponent’s behavior to choose the winning move. In summary, generative autonomous agents struggle to anticipate or effectively incorporate other agents’ actions into their decision-making.  ## Synthesis Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, with <tt>Llama3</tt> showing moderate adherence and </tt>DeepSeek-R1</tt> displaying considerable inconsistency. <tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency. In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and performs poorly in aligning actions with specified preferences. <tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> consistently display rational behavior at both first- and second-order levels. <tt>Llama3</tt>, although prone to random behavior when generating strategies, adapts more effectively in one-shot decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly in both strategic and one-shot formats, rarely exhibiting coherent rationality. All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to infer and act on opponents’ simple behaviour ## Authors Maxime MORGE ## License This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.