# PyGAAMAS

Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate 
the social behaviors of LLM-based agents.

## Dictator Game

The dictator game is a classic game which is used to analyze players' personal preferences.
In this game, there are two players: the dictator and the recipient. Given two allocation options,
the dictator needs to take action, choosing one allocation,
while the recipient must accept the option chosen by the dictator.
Here, the dictator’s choice is considered to reflect the personal preference.

### Default preferences

The dictator’s choice reflect the LLM's preference.

The figure below presents a violin plot depicting the share of the total amount (\$100)
that the dictator allocates to themselves for each model. 
The temperature is fixed at 0.7, and each experiment was conducted 30 times.
The median share taken by GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1 is 50.
When we prompt the models to generate a strategy in the form of an algorithm implemented 
in the Python programming language, rather than generating an action, all models divide 
the amount fairly except for GPT-4.5, which takes approximately 70% of the total amount for itself.
It is worth noticing that, under these standard conditions, humans typically keep an average of around \$80
(Fortsythe et al. 1994). It is interesting to note that the variability observed between different executions 
in the responses of the same LLM is comparable to the diversity of behaviors observed in humans. In other words, 
this intra-model variability can be used to simulate the diversity of human behaviors based on 
their experiences, preferences, or context.

*[Fairness in Simple Bargaining Experiments](https://doi.org/10.1006/game.1994.1021)*
Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M.
Games and Economic Behavior, 6(3), 347-369. 1994.

![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)

The figure below represents the evolution of the share of the total amount ($100) that the dictator allocates 
to themselves as a function of temperature for each model, along with the 95% confidence interval. 
Each experiment was conducted 30 times. It can be observed that temperature influences the variability 
of the models' decisions. At low temperatures, choices are more deterministic and follow a stable trend, 
whereas at high temperatures, the diversity of allocations increases, 
reflecting a more random exploration of the available options.

![My Share vs Temperature with Confidence Interval](figures/dictator/dictator_temperature.svg)

### Preference alignment

We define four preferences for the dictator:
1. She prioritizes her own interests, aiming to maximize her own income (selfish).
2. She prioritizes the other player’s interests, aiming to maximize their income (altruism).
3. She focuses on the common good, aiming to maximize the total income between her and the other player (utilitarian).
4. She prioritizes fairness between herself and the other player, aiming to maximize the minimum income (egalitarian).

We consider 4 allocation options where money can be lost in the division, each corresponding to one of the four preferences:
1. The dictator keeps 500, the other player receives 100, and a total of 400 is lost in the division (selfish).
2. The dictator keeps 100, the other player receives 500, and again, 400 is lost in the division (altruism).
3. The dictator keeps 400, the other player receives 300, resulting in a 300 loss (utilitarian)
4. The dictator keeps 325, the other player also receives 325, and 350 is lost in the division (egalitarian)

The following table shows the accuracy of the dictator's decision for each model and preference.
The temperature is fixed at 0.7, and each experiment was conducted 30 times.

| Model           | SELFISH   | ALTRUISTIC   | UTILITARIAN   | EGALITARIAN    |
|-----------------|-----------|--------------|---------------|----------------|
| gpt-4.5         | 1.0       | 1.0          | 0.5           | 1.0            |
| llama3          | 1.0       | 0.9          | 0.4           | 0.73           |
| mistral-small   | 0.4       | 0.93         | 0.76          | 0.16           |
| deepseek-r1     | 0.06      | 0.2          | 0.76          | 0.03           |

Bad decisions can be explained either by arithmetic errors (e.g., it is not the case that 500 + 100 > 400 + 300) 
or by misinterpretations of preferences (e.g., ‘I’m choosing to prioritize the common interest by keeping a 
relatively equal split with the other player’).

This table can be used to evaluate the models based on their ability to align with different preferences.
GPT-4.5 exhibits strong alignment across all preferences except for utilitarianism, where its performance is moderate.
Llama3 demonstrates a strong ability to align with selfish and altruistic preferences, with moderate alignment 
for egalitarian preferences and lower alignment for utilitarian preferences. 
Mistral-small shows the best alignment with altruistic preferences, while maintaining a more balanced 
performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences, 
but performs poorly in aligning with other preferences.

## Ring-network game

A player is rational if she plays a best response to her beliefs.
She satisfies second-order rationality if she is rational and also believes that others are rational.
In other words, a second-order rational agent not only considers the best course of action for herself
but also anticipates how others make their decisions.

The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational,
while 71% exhibit second-order rationality.

**[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)**  
Terri Kneeland (2015) Published in *Econometrica*, Volume 83, Issue 5, Pages 2065-2079  
DOI: [10.3982/ECTA11983](https://doi.org/10.3982/ECTA11983)

Ring games are designed to isolate the behavioral implications of different levels of rationality.
To assess players’ first- and second-order rationality, we consider a simplified version of the ring-network game.
This game features two players, each with two available strategies, where both players aim to maximize their own payoff.
The corresponding payoff matrix is shown below:

| Player 1 \ Player 2 | Strategy A | Strategy B |
|---------------------|------------|-----------|
| **Strategy X**     | (15,10)    | (5,5)     |
| **Strategy Y**     | (0,5)      | (10,0)    |


If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold).
If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and
Y is the best response if she believes Player 2 will play B.
If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X.
This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and
since X is the best response to A, Player 1 will choose X.

We establish three types of belief:
- *implicit* belief: The optimal action must be inferred from the natural language description of the payoff matrix.
- *explicit* belief: This belief focuses on analyzing Player 2’s actions, where Strategy B is strictly dominated by Strategy A.
- *given* belief: The optimal action for Player 1 is explicitly stated in the prompt.

We set up three forms of belief:
- *implicit* belief where the optimal action must be deduced from the description
  of the payoff matrix in natural language;
- *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A).
- *given* belief* where optimal action of Player 1is explicitly provided in the prompt;

### Player 2

The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1.
The results indicate how well each model performs under each belief type.

| Model          | Given    | Explicit  | Implicit |
|----------------|---------|-----------|----------|
| gpt-4.5        | 1.00    | 1.00      | 1.00     |
| mistral-small  | 1.00    | 1.00      | 0.87     |
| llama3         | 1.00    | 0.90      | 0.17     |
| deepseek-r1    | 0.83    | 0.57      | 0.60     |

Here’s a refined version of your text:

GPT-4.5 achieves a perfect score across all belief types,
demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition.
Mistral-Small consistently outperforms the other open-weight models across all belief types.
Its strong performance with implicit belief indicates that it can effectively
deduce the optimal action from the payoff matrix description.
Llama3 performs well with a given belief, but significantly underperforms with an implicit belief,
suggesting it may struggle to infer optimal actions solely from natural language descriptions.
DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs,
indicating it may not be a good candidate to simulate rationality as the other models.

### Player 1

In order to adjust the difficulty of taking the optimal
action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup;
- b. we reduce the difference in payoffs;
- c.  we increase the expected payoff for the incorrect choice Y
- d. we decrease the expected payoff for the correct choice X.

| **Action \ Opponent Action (version)** | **A(a)** | **B(a)** | | **A(b)** | **B(b)** | | **A(c)** | **B(c)** | | **A(d)** | **B(d)** |
|----------------------------------------|----------|----------|-|----------|----------|-|----------|----------|-|----------|----------|
| **X**                                  | 15       | 5        | | 8        | 7        | | 6        | 5        | | 15       | 5        |
| **Y**                                  | 0        | 10       | | 7        | 8        | | 0        | 10       | | 0        | 40       |



| Model         | | Given (a) | Explicit (a) | Implicit (a) | | Given (b) | Explicit (b) | Implicit (b) |  | Given (c) | Explicit (c) | Implicit (c) |  | Given (d) | Explicit (d) | Implicit (d) |
|---------------|-|-----------|--------------|--------------|-|-----------|--------------|--------------|--|-----------|--------------|--------------|--|-----------|--------------|--------------|
| gpt4-.5       | | 1.00      | 1.00         | 1.00         | | 1.00      | 0.67         | 0.00         |  | 0.86      | 0.83         | 0.00         |  | 0.50      | 0.90         | 0.00         |
| llama3        | | 0.97      | 1.00         | 1.00         | | 0.77      | 0.80         | 0.60         |  | 0.97      | 0.90         | 0.93         |  | 0.83      | 0.90         | 0.60         |
| mistral-small | | 0.93      | 0.97         | 1.00         | | 0.87      | 0.77         | 0.60         |  | 0.77      | 0.60         | 0.70         |  | 0.73      | 0.57         | 0.37         |
| deepseek-r1   | | 0.80      | 0.53         | 0.57         | | 0.67      | 0.60         | 0.53         |  | 0.67      | 0.63         | 0.47         |  | 0.70      | 0.50         | 0.57         |

GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief
when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward,
it is confused by the altered payoffs.
LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types
and adjusted payoff matrices.
Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d).
DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.


## Guess the Next Move

In order to evaluate the ability of  LLMs to predict the opponent’s next move, we consider a 
simplified version of the Rock-Paper-Scissors game.

Rules:
1.	The opponent follows a hidden strategy (repeating pattern).
2.	The player must predict the opponent’s next move (Rock, Paper, or Scissors).
3.	A correct guess earns 1 point, and an incorrect guess earns 0 points.
4.	The game can run for N rounds, and the player’s accuracy is evaluated at the each round.

We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.

The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs, whatever they are  is barely better than that of a random strategy.
We observe that the performance of LLMs, whether proprietary or open-weight, is barely better than that of a random strategy.

![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)

![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg)

![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)


## Rock-Paper-Scissors

To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally 
based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.

RPS is a simultaneous, zero-sum game for two players. 
The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; 
and if both players take the same action, the game is a tie. Scoring is as follows: 
a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.

The objective in R-P-S is straightforward: win by selecting the optimal action 
based on the opponent’s move. Since the rules are simple and deterministic, 
LLMs can always make the correct choice. Therefore, RPS serves as a tool to
assess an LLM’s ability to identify and capitalize on patterns in an opponent’s 
non-random behavior.

For a fine-grained analysis of the ability  of LLMs to identify
opponent’s patterns, we set up 3 simple opponent’s patterns:
1. the opponent’s actions remaining constant as R, S, and P, respectively;
2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
3. the opponent’s actions looping in a 3-step pattern (R-P-S).

We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) 
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.

The figures below present the average points earned per round for each model against 
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs is barely better than that of a random strategy.

![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)

![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)

![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)



## Authors

Maxime MORGE

## License

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.