Skip to content
Snippets Groups Projects
Commit 67d325a6 authored by Maxime MORGE's avatar Maxime MORGE
Browse files

Test GPT-4.5 with the game "Guess the Next Move"

parent e06b2275
No related branches found
No related tags found
No related merge requests found
<?xml version="1.0" encoding="UTF-8"?>
<module version="4">
<component name="PyDocumentationSettings">
<option name="format" value="PLAIN" />
<option name="myDocStringFormat" value="Plain" />
</component>
</module>
\ No newline at end of file
...@@ -3,13 +3,6 @@ ...@@ -3,13 +3,6 @@
<component name="CsvFileAttributes"> <component name="CsvFileAttributes">
<option name="attributeMap"> <option name="attributeMap">
<map> <map>
<entry key="$PROJECT_DIR$/data/guess/guess.csv">
<value>
<Attribute>
<option name="separator" value="," />
</Attribute>
</value>
</entry>
<entry key="$PROJECT_DIR$/data/dictator/dictator_setup.csv"> <entry key="$PROJECT_DIR$/data/dictator/dictator_setup.csv">
<value> <value>
<Attribute> <Attribute>
...@@ -17,14 +10,14 @@ ...@@ -17,14 +10,14 @@
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.a.csv"> <entry key="$PROJECT_DIR$/data/guess/guess.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.a.old.csv"> <entry key="$PROJECT_DIR$/data/ring/ring.1.a.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
...@@ -38,13 +31,6 @@ ...@@ -38,13 +31,6 @@
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.b.old.csv">
<value>
<Attribute>
<option name="separator" value="," />
</Attribute>
</value>
</entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.c.csv"> <entry key="$PROJECT_DIR$/data/ring/ring.1.c.csv">
<value> <value>
<Attribute> <Attribute>
...@@ -52,77 +38,77 @@ ...@@ -52,77 +38,77 @@
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.c.old.csv"> <entry key="$PROJECT_DIR$/data/ring/ring.1.d.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.d.csv"> <entry key="$PROJECT_DIR$/data/ring/ring.2.a.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.1.d.old.csv"> <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.a.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.2.a.csv"> <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.b.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.2.csv"> <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.c.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/data/ring/ring.2.old.csv"> <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.d.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.a.csv"> <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.2.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.b.csv"> <entry key="$PROJECT_DIR$/data/guess/guess.1.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.c.csv"> <entry key="$PROJECT_DIR$/data/guess/guess.2.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.d.csv"> <entry key="$PROJECT_DIR$/data/guess/guess.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
</Attribute> </Attribute>
</value> </value>
</entry> </entry>
<entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.2.csv"> <entry key="$PROJECT_DIR$/data/guess/guess.old.csv">
<value> <value>
<Attribute> <Attribute>
<option name="separator" value="," /> <option name="separator" value="," />
......
...@@ -76,66 +76,6 @@ Mistral-small shows the best alignment with altruistic preferences, while mainta ...@@ -76,66 +76,6 @@ Mistral-small shows the best alignment with altruistic preferences, while mainta
performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences, performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences,
but performs poorly in aligning with other preferences. but performs poorly in aligning with other preferences.
## Guess the Next Move
This simplified version of the Rock-Paper-Scissors game aims to evaluate the ability of
LLMs to predict the opponent’s next move.
Rules:
1. The opponent follows a hidden strategy (repeating pattern).
2. The player must predict the opponent’s next move (Rock, Paper, or Scissors).
3. A correct guess earns 1 point, and an incorrect guess earns 0 points.
4. The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs is barely better than that of a random strategy.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)
## Rock-Paper-Scissors
Rock-Paper-Scissors (RPS) is a simultaneous, zero-sum game for two players.
The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock;
and if both players take the same action, the game is a tie. Scoring is as follows:
a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
The objective in R-P-S is straightforward: win by selecting the optimal action
based on the opponent’s move. Since the rules are simple and deterministic,
LLMs can always make the correct choice. Therefore, RPS serves as a tool to
assess an LLM’s ability to identify and capitalize on patterns in an opponent’s
non-random behavior.
For a fine-grained analysis of the ability of LLMs to identify
opponent’s patterns, we set up 3 simple opponent’s patterns:
1. the opponent’s actions remaining constant as R, S, and P, respectively;
2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
3. the opponent’s actions looping in a 3-step pattern (R-P-S).
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs is barely better than that of a random strategy.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
## Ring-network game ## Ring-network game
A player is rational if she plays a best response to her beliefs. A player is rational if she plays a best response to her beliefs.
...@@ -143,7 +83,7 @@ She satisfies second-order rationality if she is rational and also believes that ...@@ -143,7 +83,7 @@ She satisfies second-order rationality if she is rational and also believes that
In other words, a second-order rational agent not only considers the best course of action for herself In other words, a second-order rational agent not only considers the best course of action for herself
but also anticipates how others make their decisions. but also anticipates how others make their decisions.
The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational, The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational,
while 71% exhibit second-order rationality. while 71% exhibit second-order rationality.
**[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)** **[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)**
...@@ -162,10 +102,10 @@ The corresponding payoff matrix is shown below: ...@@ -162,10 +102,10 @@ The corresponding payoff matrix is shown below:
If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold). If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold).
If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and
Y is the best response if she believes Player 2 will play B. Y is the best response if she believes Player 2 will play B.
If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X. If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X.
This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and
since X is the best response to A, Player 1 will choose X. since X is the best response to A, Player 1 will choose X.
We establish three types of belief: We establish three types of belief:
...@@ -174,14 +114,14 @@ We establish three types of belief: ...@@ -174,14 +114,14 @@ We establish three types of belief:
- *given* belief: The optimal action for Player 1 is explicitly stated in the prompt. - *given* belief: The optimal action for Player 1 is explicitly stated in the prompt.
We set up three forms of belief: We set up three forms of belief:
- *implicit* belief where the optimal action must be deduced from the description - *implicit* belief where the optimal action must be deduced from the description
of the payoff matrix in natural language; of the payoff matrix in natural language;
- *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A). - *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A).
- *given* belief* where optimal action of Player 1is explicitly provided in the prompt; - *given* belief* where optimal action of Player 1is explicitly provided in the prompt;
### Player 2 ### Player 2
The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1. The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1.
The results indicate how well each model performs under each belief type. The results indicate how well each model performs under each belief type.
| Model | Given | Explicit | Implicit | | Model | Given | Explicit | Implicit |
...@@ -193,21 +133,21 @@ The results indicate how well each model performs under each belief type. ...@@ -193,21 +133,21 @@ The results indicate how well each model performs under each belief type.
Here’s a refined version of your text: Here’s a refined version of your text:
GPT-4.5 achieves a perfect score across all belief types, GPT-4.5 achieves a perfect score across all belief types,
demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition. demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition.
Mistral-Small consistently outperforms the other open-weight models across all belief types. Mistral-Small consistently outperforms the other open-weight models across all belief types.
Its strong performance with implicit belief indicates that it can effectively Its strong performance with implicit belief indicates that it can effectively
deduce the optimal action from the payoff matrix description. deduce the optimal action from the payoff matrix description.
Llama3 performs well with a given belief, but significantly underperforms with an implicit belief, Llama3 performs well with a given belief, but significantly underperforms with an implicit belief,
suggesting it may struggle to infer optimal actions solely from natural language descriptions. suggesting it may struggle to infer optimal actions solely from natural language descriptions.
DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs, DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs,
indicating it may not be a good candidate to simulate rationality as the other models. indicating it may not be a good candidate to simulate rationality as the other models.
### Player 1 ### Player 1
In order to adjust the difficulty of taking the optimal In order to adjust the difficulty of taking the optimal
action, we consider 4 versions of the player’s payoff matrix: action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup; - a. is the original setup;
- b. we reduce the difference in payoffs; - b. we reduce the difference in payoffs;
- c. we increase the expected payoff for the incorrect choice Y - c. we increase the expected payoff for the incorrect choice Y
- d. we decrease the expected payoff for the correct choice X. - d. we decrease the expected payoff for the correct choice X.
...@@ -226,14 +166,80 @@ action, we consider 4 versions of the player’s payoff matrix: ...@@ -226,14 +166,80 @@ action, we consider 4 versions of the player’s payoff matrix:
| mistral-small | | 0.93 | 0.97 | 1.00 | | 0.87 | 0.77 | 0.60 | | 0.77 | 0.60 | 0.70 | | 0.73 | 0.57 | 0.37 | | mistral-small | | 0.93 | 0.97 | 1.00 | | 0.87 | 0.77 | 0.60 | | 0.77 | 0.60 | 0.70 | | 0.73 | 0.57 | 0.37 |
| deepseek-r1 | | 0.80 | 0.53 | 0.57 | | 0.67 | 0.60 | 0.53 | | 0.67 | 0.63 | 0.47 | | 0.70 | 0.50 | 0.57 | | deepseek-r1 | | 0.80 | 0.53 | 0.57 | | 0.67 | 0.60 | 0.53 | | 0.67 | 0.63 | 0.47 | | 0.70 | 0.50 | 0.57 |
GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief
when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward, when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward,
it is confused by the altered payoffs. it is confused by the altered payoffs.
LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types
and adjusted payoff matrices. and adjusted payoff matrices.
Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d). Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d).
DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality. DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.
## Guess the Next Move
In order to evaluate the ability of LLMs to predict the opponent’s next move, we consider a
simplified version of the Rock-Paper-Scissors game.
Rules:
1. The opponent follows a hidden strategy (repeating pattern).
2. The player must predict the opponent’s next move (Rock, Paper, or Scissors).
3. A correct guess earns 1 point, and an incorrect guess earns 0 points.
4. The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs, whatever they are is barely better than that of a random strategy.
We observe that the performance of LLMs, whether proprietary or open-weight, is barely better than that of a random strategy.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)
## Rock-Paper-Scissors
To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally
based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.
RPS is a simultaneous, zero-sum game for two players.
The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock;
and if both players take the same action, the game is a tie. Scoring is as follows:
a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
The objective in R-P-S is straightforward: win by selecting the optimal action
based on the opponent’s move. Since the rules are simple and deterministic,
LLMs can always make the correct choice. Therefore, RPS serves as a tool to
assess an LLM’s ability to identify and capitalize on patterns in an opponent’s
non-random behavior.
For a fine-grained analysis of the ability of LLMs to identify
opponent’s patterns, we set up 3 simple opponent’s patterns:
1. the opponent’s actions remaining constant as R, S, and P, respectively;
2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
3. the opponent’s actions looping in a 3-step pattern (R-P-S).
We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1)
in identifying these patterns by calculating the average points earned per round.
The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
The figures below present the average points earned per round for each model against
the three opponent’s patterns. The 95% confidence interval is also shown.
We observe that the performance of LLMs is barely better than that of a random strategy.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
## Authors ## Authors
Maxime MORGE Maxime MORGE
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Model,Given,Explicit,Implicit
deepseek-r1,0.8333333333333334,0.5666666666666667,0.6
gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
llama3,1.0,0.9,0.16666666666666666
mistral-small,1.0,1.0,0.8666666666666667
...@@ -8,7 +8,7 @@ CSV_FILE_PATH = "../../data/guess/guess.csv" ...@@ -8,7 +8,7 @@ CSV_FILE_PATH = "../../data/guess/guess.csv"
# Define RPS Constant Experiment class # Define RPS Constant Experiment class
class GuessExperiment: class GuessExperiment:
def __init__(self): def __init__(self):
self.models = ["random", "llama3", "mistral-small", "deepseek-r1"] # You can also add "gpt-4.5-preview-2025-02-27" , self.models = ["random", "llama3", "mistral-small", "deepseek-r1"] # You can also add "gpt-4.5-preview-2025-02-27"
self.opponent_strategies = { self.opponent_strategies = {
"always_rock": lambda history: "Rock", "always_rock": lambda history: "Rock",
"always_paper": lambda history: "Paper", "always_paper": lambda history: "Paper",
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment