Test GPT-4.5 with the game "Guess the Next Move"

67d325a6 · Maxime MORGE · e06b2275 · 67d325a6 · 67d325a6 · 67d325a6
Commit 67d325a6 authored 2 months ago by Maxime MORGE
--- a/.idea/PyGAAMAS.iml
+++ b/.idea/PyGAAMAS.iml
+<?xml version="1.0" encoding="UTF-8"?>
+<module version="4">
+  <component name="PyDocumentationSettings">
+    <option name="format" value="PLAIN" />
+    <option name="myDocStringFormat" value="Plain" />
+  </component>
+</module>
\ No newline at end of file
--- a/.idea/csv-editor.xml
+++ b/.idea/csv-editor.xml
@@ -3,13 +3,6 @@
  <component name="CsvFileAttributes">
    <option name="attributeMap">
      <map>
-        <entry key="$PROJECT_DIR$/data/guess/guess.csv">
-          <value>
-            <Attribute>
-              <option name="separator" value="," />
-            </Attribute>
-          </value>
-        </entry>
        <entry key="$PROJECT_DIR$/data/dictator/dictator_setup.csv">
          <value>
            <Attribute>
@@ -17,14 +10,14 @@
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.a.csv">
+        <entry key="$PROJECT_DIR$/data/guess/guess.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.a.old.csv">
+        <entry key="$PROJECT_DIR$/data/ring/ring.1.a.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
@@ -38,13 +31,6 @@
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.b.old.csv">
-          <value>
-            <Attribute>
-              <option name="separator" value="," />
-            </Attribute>
-          </value>
-        </entry>
        <entry key="$PROJECT_DIR$/data/ring/ring.1.c.csv">
          <value>
            <Attribute>
@@ -52,77 +38,77 @@
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.c.old.csv">
+        <entry key="$PROJECT_DIR$/data/ring/ring.1.d.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.d.csv">
+        <entry key="$PROJECT_DIR$/data/ring/ring.2.a.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.1.d.old.csv">
+        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.a.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.2.a.csv">
+        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.b.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.2.csv">
+        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.c.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/data/ring/ring.2.old.csv">
+        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.d.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.a.csv">
+        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.2.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.b.csv">
+        <entry key="$PROJECT_DIR$/data/guess/guess.1.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.c.csv">
+        <entry key="$PROJECT_DIR$/data/guess/guess.2.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.1.d.csv">
+        <entry key="$PROJECT_DIR$/data/guess/guess.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />
            </Attribute>
          </value>
        </entry>
-        <entry key="$PROJECT_DIR$/figures/ring/ring_accuracy.2.csv">
+        <entry key="$PROJECT_DIR$/data/guess/guess.old.csv">
          <value>
            <Attribute>
              <option name="separator" value="," />

--- a/README.md
+++ b/README.md
@@ -76,66 +76,6 @@ Mistral-small shows the best alignment with altruistic preferences, while mainta
 performance across the other preferences. Deepseek-r1 is most capable of aligning with utilitarian preferences, 
 but performs poorly in aligning with other preferences.
-## Guess the Next Move
-This simplified version of the Rock-Paper-Scissors game aims to evaluate the ability of 
-LLMs to predict the opponent’s next move.
-Rules:
-1.	The opponent follows a hidden strategy (repeating pattern).
-2.	The player must predict the opponent’s next move (Rock, Paper, or Scissors).
-3.	A correct guess earns 1 point, and an incorrect guess earns 0 points.
-4.	The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
-We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1)
-in identifying these patterns by calculating the average points earned per round.
-The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
-The figures below present the average points earned per round for each model against
-the three opponent’s patterns. The 95% confidence interval is also shown.
-We observe that the performance of LLMs is barely better than that of a random strategy.
-![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)
-![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg)
-![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)
-## Rock-Paper-Scissors
-Rock-Paper-Scissors (RPS) is a simultaneous, zero-sum game for two players. 
-The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; 
-and if both players take the same action, the game is a tie. Scoring is as follows: 
-a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
-The objective in R-P-S is straightforward: win by selecting the optimal action 
-based on the opponent’s move. Since the rules are simple and deterministic, 
-LLMs can always make the correct choice. Therefore, RPS serves as a tool to
-assess an LLM’s ability to identify and capitalize on patterns in an opponent’s 
-non-random behavior.
-For a fine-grained analysis of the ability  of LLMs to identify
-opponent’s patterns, we set up 3 simple opponent’s patterns:
-1. the opponent’s actions remaining constant as R, S, and P, respectively;
-2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
-3. the opponent’s actions looping in a 3-step pattern (R-P-S).
-We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) 
-in identifying these patterns by calculating the average points earned per round.
-The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
-The figures below present the average points earned per round for each model against 
-the three opponent’s patterns. The 95% confidence interval is also shown.
-We observe that the performance of LLMs is barely better than that of a random strategy.
-![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
-![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)
-![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
 ## Ring-network game
 A player is rational if she plays a best response to her beliefs.
@@ -143,7 +83,7 @@ She satisfies second-order rationality if she is rational and also believes that
 In other words, a second-order rational agent not only considers the best course of action for herself
 but also anticipates how others make their decisions.
-The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational, 
+The experiments conduct by Kneeland (2015) demonstrate that 93% of the subjects are rational,
 while 71% exhibit second-order rationality.
 **[Identifying Higher-Order Rationality](https://doi.org/10.3982/ECTA11983)**  
@@ -162,10 +102,10 @@ The corresponding payoff matrix is shown below:
 If Player 2 is rational, she must choose A, as B is strictly dominated (i.e., B is never a best response to any beliefs Player 2 may hold).
-If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and 
+If Player 1 is rational, she can choose either X or Y since X is the best response if she believes Player 2 will play A and
 Y is the best response if she believes Player 2 will play B.
 If Player 1 satisfies second-order rationality (i.e., she is rational and believes Player 2 is rational), then she must play Strategy X.
-This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and 
+This is because Player 1, believing that Player 2 is rational, must also believe Player 2 will play A and
 since X is the best response to A, Player 1 will choose X.
 We establish three types of belief:
@@ -174,14 +114,14 @@ We establish three types of belief:
 - *given* belief: The optimal action for Player 1 is explicitly stated in the prompt.
 We set up three forms of belief:
- *implicit* belief where the optimal action must be deduced from the description 
+- *implicit* belief where the optimal action must be deduced from the description
  of the payoff matrix in natural language;
 - *explicit* belief which analyze actions of Player 2 (B is strictly dominated by A).
 - *given* belief* where optimal action of Player 1is explicitly provided in the prompt;
 ### Player 2
-The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1. 
+The models evaluated include Gpt-4.5-preview-2025-02-27, Mistral-Small, Llama3, and DeepSeek-R1.
 The results indicate how well each model performs under each belief type.
 | Model          | Given    | Explicit  | Implicit |
@@ -193,21 +133,21 @@ The results indicate how well each model performs under each belief type.
 Here’s a refined version of your text:
-GPT-4.5 achieves a perfect score across all belief types, 
+GPT-4.5 achieves a perfect score across all belief types,
 demonstrating an exceptional ability to take rational decisions, even in the implicit belief condition.
-Mistral-Small consistently outperforms the other open-weight models across all belief types. 
+Mistral-Small consistently outperforms the other open-weight models across all belief types.
-Its strong performance with implicit belief indicates that it can effectively 
+Its strong performance with implicit belief indicates that it can effectively
-deduce the optimal action from the payoff matrix description. 
+deduce the optimal action from the payoff matrix description.
-Llama3 performs well with a given belief, but significantly underperforms with an implicit belief, 
+Llama3 performs well with a given belief, but significantly underperforms with an implicit belief,
 suggesting it may struggle to infer optimal actions solely from natural language descriptions.
-DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs, 
+DeepSeek-R1 shows the weakest performance, particularly with explicit beliefs,
 indicating it may not be a good candidate to simulate rationality as the other models.
 ### Player 1
 In order to adjust the difficulty of taking the optimal
-action, we consider 4 versions of the player’s payoff matrix: 
+action, we consider 4 versions of the player’s payoff matrix:
- a. is the original setup; 
+- a. is the original setup;
 - b. we reduce the difference in payoffs;
 - c.  we increase the expected payoff for the incorrect choice Y
 - d. we decrease the expected payoff for the correct choice X.
@@ -226,14 +166,80 @@ action, we consider 4 versions of the player’s payoff matrix:
 | mistral-small | | 0.93      | 0.97         | 1.00         | | 0.87      | 0.77         | 0.60         |  | 0.77      | 0.60         | 0.70         |  | 0.73      | 0.57         | 0.37         |
 | deepseek-r1   | | 0.80      | 0.53         | 0.57         | | 0.67      | 0.60         | 0.53         |  | 0.67      | 0.63         | 0.47         |  | 0.70      | 0.50         | 0.57         |
-GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief 
+GPT-4.5 achieves perfect performance in the standard (a) setup but struggles significantly with implicit belief
-when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward, 
+when the payoff structure changes (b, c, d). This suggests that while it excels when conditions are straightforward,
 it is confused by the altered payoffs.
-LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types 
+LLama3 demonstrates the most consistent and robust performance, capable of adapting to various belief types
-and adjusted payoff matrices. 
+and adjusted payoff matrices.
-Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d). 
+Mistral-Small, while performing well with given and explicit beliefs, faces challenges in implicit belief, particularly in version (d).
 DeepSeek-R1 appears to be the least capable, suggesting it may not be an ideal candidate for modeling second-order rationality.
+## Guess the Next Move
+In order to evaluate the ability of  LLMs to predict the opponent’s next move, we consider a 
+simplified version of the Rock-Paper-Scissors game.
+Rules:
+1.	The opponent follows a hidden strategy (repeating pattern).
+2.	The player must predict the opponent’s next move (Rock, Paper, or Scissors).
+3.	A correct guess earns 1 point, and an incorrect guess earns 0 points.
+4.	The game can run for N rounds, and the player’s accuracy is evaluated at the each round.
+We evaluate the performance of the models (GPT-4.5, Llama3, Mistral-Small, and DeepSeek-R1)
+in identifying these patterns by calculating the average points earned per round.
+The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
+The figures below present the average points earned per round for each model against
+the three opponent’s patterns. The 95% confidence interval is also shown.
+We observe that the performance of LLMs, whatever they are  is barely better than that of a random strategy.
+We observe that the performance of LLMs, whether proprietary or open-weight, is barely better than that of a random strategy.
+![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg)
+![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg)
+![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg)
+## Rock-Paper-Scissors
+To evaluate the ability of LLMs to predict not only the opponent’s next move but also to act rationally 
+based on their prediction, we consider the Rock-Paper-Scissors (RPS) game.
+RPS is a simultaneous, zero-sum game for two players. 
+The rules of RPS are simple: rock beats scissors, scissors beat paper, paper beats rock; 
+and if both players take the same action, the game is a tie. Scoring is as follows: 
+a win earns 2 points, a tie earns 1 point, and a loss earns 0 points.
+The objective in R-P-S is straightforward: win by selecting the optimal action 
+based on the opponent’s move. Since the rules are simple and deterministic, 
+LLMs can always make the correct choice. Therefore, RPS serves as a tool to
+assess an LLM’s ability to identify and capitalize on patterns in an opponent’s 
+non-random behavior.
+For a fine-grained analysis of the ability  of LLMs to identify
+opponent’s patterns, we set up 3 simple opponent’s patterns:
+1. the opponent’s actions remaining constant as R, S, and P, respectively;
+2. the opponent’s actions looping in a 2-step pattern (R-P, P-S, S-R);
+3. the opponent’s actions looping in a 3-step pattern (R-P-S).
+We evaluate the performance of the models (Llama3, Mistral-Small, and DeepSeek-R1) 
+in identifying these patterns by calculating the average points earned per round.
+The temperature is fixed at 0.7, and each game of 10 round is playerd 30 times.
+The figures below present the average points earned per round for each model against 
+the three opponent’s patterns. The 95% confidence interval is also shown.
+We observe that the performance of LLMs is barely better than that of a random strategy.
+![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
+![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)
+![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
 ## Authors
 Maxime MORGE

--- a/data/guess/guess.csv
+++ b/data/guess/guess.csv
--- a/data/ring/ring.2.a.csv
+++ b/data/ring/ring.2.a.csv
--- a/figures/guess/guess_2loop.svg
+++ b/figures/guess/guess_2loop.svg
--- a/figures/guess/guess_3loop.svg
+++ b/figures/guess/guess_3loop.svg
--- a/figures/guess/guess_constant.svg
+++ b/figures/guess/guess_constant.svg
--- a/figures/ring/ring_accuracy.2.a.csv
+++ b/figures/ring/ring_accuracy.2.a.csv
+Model,Given,Explicit,Implicit
+deepseek-r1,0.8333333333333334,0.5666666666666667,0.6
+gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
+llama3,1.0,0.9,0.16666666666666666
+mistral-small,1.0,1.0,0.8666666666666667
--- a/src/guess/guess_experiments.py
+++ b/src/guess/guess_experiments.py
@@ -8,7 +8,7 @@ CSV_FILE_PATH = "../../data/guess/guess.csv"
 # Define RPS Constant Experiment class
 class GuessExperiment:
    def __init__(self):
-        self.models = ["random", "llama3", "mistral-small", "deepseek-r1"]  # You can also add "gpt-4.5-preview-2025-02-27" ,
+        self.models = ["random", "llama3", "mistral-small", "deepseek-r1"]  #  You can also add "gpt-4.5-preview-2025-02-27"
        self.opponent_strategies = {
            "always_rock": lambda history: "Rock",
            "always_paper": lambda history: "Paper",