-
Maxime Morge authored98e2aa16
PyGAAMAS
Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents.
This prototype explores the potential of homo silicus for social simulation. We examine the behaviour exhibited by intelligent machines, particularly how generative agents deviate from the principles of rationality. To assess their responses to simple human-like strategies, we employ a series of tightly controlled and theoretically well-understood games. Through behavioral game theory, we evaluate the ability of GPT-4.5, Llama3, Mistral-Small}, and DeepSeek-R1 to make coherent one-shot decisions, generate algorithmic strategies based on explicit preferences, adhere to first- and second-order rationality principles, and refine their beliefs in response to other agents’ behaviours.
Economic Rationality
To evaluate the economic rationality of various LLMs, we introduce an investment game designed to test whether these models follow stable decision-making patterns or react erratically to changes in the game’s parameters.
In this game, an investor allocates a basket x_t=(x^A_t, x^B_t) of 100 points between two assets: Asset A and Asset B. The value of these points depends on random prices p_t=(p_{t}^A, p_t^B), which determine the monetary return per allocated point. For example, if p_t^A= 0.8 and p_t^B = 0.8, each point assigned to Asset A is worth \$0.8, while each point allocated to Asset B yields \$0.5. T he game is played 25 times to assess the consistency of the investor’s decisions.
To evaluate the rationality of the decisions, we use Afriat's critical cost efficiency index (CCEI), i.e. a widely used measure in experimental economics. The CCEI assesses whether choices adhere to the generalized axiom of revealed preference (GARP), a fundamental principle of rational decision-making. If an individual violates rational choice consistency, the CCEI determines the minimal budget adjustment required to make their decisions align with rationality. Mathematically, the budget for each basket is calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is derived from observed decisions by solving a linear optimization problem that finds the largest \lambda, where 0 \leq \lambda \leq 1, such that for every observation, the adjusted decisions satisfy the rationality constraint: p_t \cdot x_t \leq \lambda I_t. This means that if we slightly reduce the budget, multiplying it by \lambda, the choices will become consistent with rational decision-making. A CCEI close to 1 indicates high rationality and consistency with economic theory. A low CCEEI suggests irrational or inconsistent decision-making.
To ensure response consistency, each model undergoes 30 iterations of the game with a fixed temperature of 0.0. The results shown in Figure below highlight significant differences in decision-making consistency among the evaluated models. GPT-4.5, LLama3.3:latest and DeepSeek-R1:7b stand out with a perfect CCEI score of 1.0, indicating flawless rationality in decision-making. Mistral-Small and Mixtral:8x7b demonstrate the next highest level of rationality. Llama3 performs moderately well, with CCEI values ranging between 0.2 and 0.74. DeepSeek-R1 exhibits inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.
Preferences
To analyse the behaviour of generative agents based on their preferences, we rely on the dictator game. This variant of the ultimatum game features a single player, the dictator, who decides how to distribute an endowment (e.g., a sum of money) between themselves and a second player, the recipient. The dictator has complete freedom in this allocation, while the recipient, having no influence over the outcome, takes on a passive role.
First, we evaluate the choices made by LLMs when playing the role of the dictator, considering these decisions as a reflection of their intrinsic preferences. Then, we subject them to specific instructions incorporating preferences to assess their ability to consider them in their decisions.
Preference Elicitation
Here, we consider that the choice of an LLM as a dictator reflects its intrinsic preferences. Each LLM is asked to directly produce a one-shot action in the dictator game. Additionally, we also asked the models to generate a strategy in the form of an algorithm implemented in the Python language. In all our experiments, one-shot actions are repeated 30 times, and the models' temperature is set to 0.7.
Newt Figure presents a violin plot illustrating the share of the total amount ($100) that the dictator allocates to themselves for each model. The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 through one-shot decisions is $50, likely due to a corpus-based biases like term frequency. The median share taken by mixtral:8x7b, Llama3.3:latest, is $60. When we ask the models to generate a strategy rather than a one-shot action, all models distribute the amount equally, except GPT-4.5, which retains about 70\% of the total amount. Interestingly, under these standard conditions, humans typically keep $80 on average. When the role assigned to the model is that of a human rather than an assistant agent, only Llama3 deviates with a median share of $60. Unlike the deterministic strategies generated by LLMs, the intra-model variability in generated actions can be used to simulate the diversity of human behaviours based on their experiences, preferences, or contexts.
Our sensitivity analysis of the temperature parameter reveals that the portion retained by the dictator remains stable. However, the decisions become more deterministic at low temperatures, whereas allocation diversity increases at high temperatures, reflecting a more random exploration of available options.
Preference alignment
We define four preferences for the dictator, each corresponding to a distinct form of social welfare:
- Egoism maximizes the dictator’s income.
- Altruism maximizes the recipient’s income.
- Utilitarianism maximizes total income.
- Egalitarianism maximizes the minimum income between the players.
We consider four allocation options where part of the money is lost in the division process, each corresponding to one of the four preferences:
- The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (egoistic).
- The dictator keeps **$100, the recipient receives $500, and $400 is lost (altruistic).
- The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (utilitarian).
- The dictator keeps **$325, the other player receives $325, and $350 is lost (egalitarian).
Table below evaluates the ability of the models to align with different preferences.
- When generating strategies, the models align perfectly with preferences, except for
- DeepSeek-R1 and Mixtral:8x7b which do not generate valid code.
- When generating actions,
- GPT-4.5 aligns well with preferences but struggles with utilitarianism.
- Llama3 aligns well with egoistic and altruistic preferences but shows lower adherence to utilitarian and egalitarian choices.
- Mistral-Small aligns better with altruistic preferences and performs moderately on utilitarianism but struggles with egoistic and egalitarian preferences.
- DeepSeek-R1 primarily aligns with utilitarianism but has low accuracy in other preferences. While a larger LLM typically aligns better with preferences, a model like Mixtral-8x7B may occasionally underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity. Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters. If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.
Model | Generation | Egoistic | Altruistic | Utilitarian | Egalitarian |
---|---|---|---|---|---|
GPT-4.5 | Strategy | 1.00 | 1.00 | 1.00 | 1.00 |
Llama3.3:latest | Strategy | 1.00 | 1.00 | 1.00 | 1.00 |
Llama3 | Strategy | 1.00 | 1.00 | 1.00 | 1.00 |
Mixtral:8x7b | Strategy | - | - | - | - |
Mistral-Small | Strategy | 1.00 | 1.00 | 1.00 | 1.00 |
DeepSeek-R1:7b | Strategy | 1.00 | 1.00 | 1.00 | 1.00 |
DeepSeek-R1 | Strategy | - | - | - | - |
GPT-4.5 | Actions | 1.00 | 1.00 | 0.50 | 1.00 |
Llama3.3:latest | Actions | 1.00 | 1.00 | 0.43 | 0.96 |
Llama3 | Actions | 1.00 | 0.90 | 0.40 | 0.73 |
Mixtral:8x7b | Actions | 0.00 | 0.00 | 0.30 | 1.00 |
Mistral-Small | Actions | 0.40 | 0.94 | 0.76 | 0.16 |
DeepSeek-R1:7b | Actions | 0.46 | 0.56 | 0.66 | 0.90 |
DeepSeek-R1 | Actions | 0.06 | 0.20 | 0.76 | 0.03 |
Errors in action selection may stem from either arithmetic miscalculations
(e.g., the model incorrectly assumes that 500 + 100 > 400 + 300) or
misinterpretations of preferences. For example, the model DeepSeek-R1
,
adopting utilitarian preferences, justifies its choice by stating, "I think
fairness is key here".
In summary, our results indicate that the models GPT-4.5
,
Llama3
, and Mistral-Small
generally align well with
preferences but have more difficulty generating individual actions than
algorithmic strategies. In contrast, DeepSeek-R1
does not generate
valid strategies and performs poorly when generating specific actions.
Rationality
An autonomous agent is rational if it chooses the optimal action based on its beliefs. This agent satisfies second-order rationality if it is rational and believes that other agents are rational. In other words, a second-order rational agent does not only consider the best choice for itself but also anticipates how others make their decisions. Experimental game theory studies show that 93 % of human subjects are rational, while 71 % exhibit second-order rationality.
Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: Fairness in Simple Bar- gaining Experiments. Games and Economic Behavior 6(3), 347–369 (1994), https://doi.org/10.1006/game.1994.1021
To evaluate the first- and second-order rationality of generative autonomous agents, we consider a simplified version of the ring-network game, which involves two players seeking to maximize their own payoff. Each player has two available actions, and the payoff matrix is presented below/
Player 1 \ Player 2 | Strategy A | Strategy B |
---|---|---|
Strategy X | (15,10) | (5,5) |
Strategy Y | (0,5) | (10,0) |
If Player 2 is rational, they must choose A because B is strictly dominated. If Player 1 is rational, they may choose either X or Y: X is the best response if Player 1 believes that Player 2 will choose A, while Y is the best response if Player 1 believes that Player 2 will choose B. If Player 1 satisfies second-order rationality, they must play X. To neutralize biases in large language models (LLMs) related to the naming of actions, we reverse the action names in half of the experiments.
We consider three types of beliefs:
- an implicit belief, where the optimal action must be deduced from
the natural language description of the payoff matrix; - an explicit belief, based on the analysis of player 2's actions, meaning that the fact that B is strictly dominated by A is provided in the prompt;
- a given belief, where the optimal action for player 1 is explicitly given in the prompt. We first evaluate the rationality of the agents and then their second-order rationality.
First Order Rationality
Table below evaluates the models’ ability to generate rational behaviour for Player 2.
Model | Generation | Given | Explicit | Implicit |
---|---|---|---|---|
gpt-4.5 | strategy | 1.00 | 1.00 | 1.00 |
mixtral:8x7b | strategy | 1.00 | 1.00 | 1.00 |
mistral-small | strategy | 1.00 | 1.00 | 1.00 |
llama3.3:latest | strategy | 1.00 | 1.00 | 0.50 |
llama3 | strategy | 0.50 | 0.50 | 0.50 |
deepseek-r1:7b | strategy | - | - | - |
deepseek-r1 | strategy | - | - | - |
— | — | — | — | — |
gpt-4.5 | actions | 1.00 | 1.00 | 1.00 |
mixtral:8x7b | actions | 1.00 | 1.00 | 1.00 |
mistral-small | actions | 1.00 | 1.00 | 0.87 |
llama33:latest | actions | 1.00 | 1.00 | 1.00 |
llama3.3 | actions | 1.00 | 0.90 | 0.17 |
deepseek-r1:7b | actions | 1.00 | 1.00 | 1.00 |
deepseek-r1 | actions | 0.83 | 0.57 | 0.60 |
When generating strategies, GPT-4.5, Mixtral-8x7B, and Mistral-Small exhibit rational behavior, whereas Llama3 adopts a random rationality. Llama3.3:latest has the same behaviour with implicit beliefs. Deepseek-R1:7b and DeepSeek-R1 fails to generate valid strategies. When generating actions, GPT-4.5, Mixtral-8x7B, DeepSeek-R1:7b, and Llama3.3:latest< demonstrate strong rational decision-making, even with implicit beliefs. Mistral-Small performs well but slightly lags in handling implicit reasoning. Llama3 struggles with implicit reasoning, while DeepSeek-R1 shows inconsistent performance. Overall, GPT-4.5 and Mixtral-8x7B are the most reliable models for generating rational behavior.
Second-Order Rationality
To adjust the difficulty of optimal decision-making, we define four variants of the payoff matrix for player 1 in Table below: (a) the original configuration, (b) the reduction of the gap between the gains, (c) the increase in the gain for the bad choice Y, and (d) the decrease in the gain for the good choice X.
Version | a | b | c | d | ||||
---|---|---|---|---|---|---|---|---|
Player 1 \ Player 2 (version) | A | B | A | B | A | B | A | B |
X | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 |
Y | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 |
Table below evaluates the models' ability to generate second-order rational behaviour for player 1.
When the models generate strategies, GPT-4.5 exhibits second-order rational behaviour in configurations (a), (c), and (d), but fails in configuration (b) to distinguish the optimal action from a nearly optimal one. Llama3 makes its decision randomly. Mistral-Small shows strong capabilities in generating second-order rational behaviour. DeepSeek-R1 does not produce valid responses.
When generating actions, Llama3 adapts to different types of beliefs and adjustments in the payoff matrix. GPT-4.5 performs well in the initial configuration (a), but encounters significant difficulties when the payoff structure changes (b, c, d), particularly with implicit beliefs. Although Mistral-Small works well with given or explicit beliefs, it faces difficulties with implicit beliefs, especially in variant (d). DeepSeek-R1 does not appear to be a good candidate for simulating second-order rationality.
When generating strategies, GPT-4.5 consistently exhibits second-order rational behavior in all configurations except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, showing no strong pattern of rational behavior. In contrast, Mistral-Small and Mixtral-8x7B demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior. Llama3.3:latest performs well with given and explicit beliefs but struggles with implicit beliefs. DeepSeek-R1 does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix but struggles with implicit beliefs, particularly in configuration (d). GPT-4.5 performs well in the initial configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), especially with implicit beliefs. Mixtral-8x7B generally performs well but shows reduced accuracy for implicit beliefs in configurations (b) and (d). Mistral-Small performs well with given or explicit beliefs but struggles with implicit beliefs, particularly in configuration (d). DeepSeek-R1:7b, in contrast to its smallest version, performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
Version | a | b | c | d | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | Generation | Given | Explicit | Implicit | Given | Explicit | Implicit | Given | Explicit | Implicit | Given | Explicit | Implicit |
gpt-4.5 | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
llama3.3:latest | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 |
llama3 | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
mixtral:8x7b | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
mistral-small | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
deepseek-r1:7b | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
deepseek-r1 | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
gpt-4.5 | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
llama3.3:latest | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.2 | 1.00 | 1.00 | 0.00 |
llama3 | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
mixtral:8x7b | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 |
mistral-small | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
deepseek-r1:7b | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 |
deepseek-r1 | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
Irrational decisions are explained by inference errors based on the natural language description of the payoff matrix. For example, in variant (d), the Mistral-Small model with given beliefs justifies its poor decision as follows: "Since player 2 is rational and A strictly dominates B, player 2 will choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y (40). Therefore, choosing Y maximizes my gain."
In summary, the results indicate that GPT-4.5 and Mistral-Small generally adopt first- and second-order rational behaviours. However, GPT-4.5 struggles to distinguish an optimal action from a nearly optimal one, while Mistral-Small encounters difficulties with implicit beliefs. Llama3 generates strategies randomly but adapts better when producing specific actions. In contrast, DeepSeek-R1 fails to provide valid strategies and generates irrational actions.
In summary, Mixtral-8x7B and GPT-4.5 demonstrate the strongest performance in both first- and second-order rationality, though GPT-4.5 struggles with near-optimal decisions and Mixtral-8x7B has reduced accuracy with implicit beliefs. Mistral-Small also performs well but faces difficulties with implicit beliefs, particularly in second-order reasoning. Llama3.3:latest succeeds when given explicit or given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex decision-making. DeepSeek-R1:7b shows strong first-order rationality but its performance declines with implicit beliefs, especially in second-order rationality tasks. In contrast, DeepSeek-R1 and Llama3 exhibit inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.
Beliefs
Beliefs — whether implicit, explicit, or given — are crucial for an autonomous agent's decision-making process. They allow for anticipating the actions of other agents.
Refine beliefs
To assess the agents' ability to refine their beliefs in predicting their interlocutor's next action, we consider a simplified version of the Rock-Paper-Scissors (RPS) game where:
- the opponent follows a hidden strategy, i.e., a repetition model;
- the player must predict the opponent's next move (Rock, Paper, or Scissors);
- a correct prediction earns 1 point, while an incorrect one earns 0 points;
- the game can be played for N rounds, and the player's accuracy is evaluated at each round.
For our experiments, we consider three simple models for the opponent where:
- the actions remain constant in the form of R, S, or P, respectively;
- the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
- the opponent's actions follow a three-step loop model (R-P-S). We evaluate the models' ability to identify these behavioural patterns by calculating the average number of points earned per round.
Figures present the average points earned per round and the
95% confidence interval for each LLM against the three opponent behavior
models in a simplified version of the Rock-Paper-Scissors (RPS) game,
whether the LLM generates a strategy or one-shot actions.
Neither Llama3 nor DeepSeek-R1 were able to generate a valid strategy.
DeepSeek-R1:7b was unable to generate either a valid strategy
or consistently valid actions. The strategies generated by the GPT-4.5
and Mistral-Small models attempt to predict the opponent’s next move based
on previous rounds by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant behavior,
they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation—
except for Llama3.3:latest, Mixtral:8x7b, and Mistral-Small
when facing a constant strategy—is barely better than a random strategy.
Assess Beliefs
To assess the agents’ ability to factor the prediction of their opponent’s next move into their decision-making, we analyse their performance of each generative agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, and a loss 0 points.
Figures below illustrates the average points earned per round along with the 95 % confidence interval for each LLM when facing constant strategies, when the model generates one-shot actions. Even if Mixtral:8x7b, and Mistral-Small accurately predict its opponent’s move, they fails to integrate this belief into its decision-making process. Only Llama3.3:latest is capable of inferring the opponent’s behavior to choose the winning move.
In summary, generative autonomous agents struggle to anticipate or effectively incorporate other agents’ actions into their decision-making.
Synthesis
Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of decision-making. Mistral-Small demonstrates the highest level of consistency in economic decision-making, with Llama3 showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency.
GPT-4.5, Llama3, and Mistral-Small generally align well with declared preferences, particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency. In contrast, DeepSeek-R1 fails to generate valid strategies and performs poorly in aligning actions with specified preferences. GPT-4.5 and Mistral-Small consistently display rational behavior at both first- and second-order levels. Llama3, although prone to random behavior when generating strategies, adapts more effectively in one-shot decision-making tasks. DeepSeek-R1 underperforms significantly in both strategic and one-shot formats, rarely exhibiting coherent rationality.
All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs into optimal responses. Only Llama3.3:latest shows any reliable ability to infer and act on opponents’ simple behaviour
Authors
Maxime MORGE
License
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.