PyGAAMAS: evaluate CR excepted for GPT 4.5

b0169be4 · Maxime Morge · 07028013 · b0169be4 · b0169be4 · b0169be4
Commit b0169be4 authored 2 months ago by Maxime Morge
--- a/.idea/PyGAAMAS.iml
+++ b/.idea/PyGAAMAS.iml
 <?xml version="1.0" encoding="UTF-8"?>
 <module version="4">
-  <component name="NewModuleRootManager">
-    <orderEntry type="jdk" jdkName="Python 3.13" jdkType="Python SDK" />
-    <orderEntry type="sourceFolder" forTests="false" />
+  <component name="PyDocumentationSettings">
+    <option name="format" value="PLAIN" />
+    <option name="myDocStringFormat" value="Plain" />
  </component>
 </module>
\ No newline at end of file
--- a/README.md
+++ b/README.md
@@ -259,24 +259,27 @@ the good choice X.
 | **X**           | 15       | 5        | 8        | 7        | 6        | 5        | 15       | 5        |
 | **Y**           | 0        | 10       | 7        | 8        | 0        | 10       | 0        | 40       |

+We introduce a prompt engineering method that incorporates Conditional Reasoning (CR), prompting the model to evaluate 
+an opponent’s optimal response to each of its own possible actions to encourage strategic foresight and 
+informed decision-making.

-Table below evaluates the models' ability to generate second-order
-rational behaviour for player 1. 
+Table below evaluates the models' ability to generate second-order  rational behaviour for player 1. The configurations 
+where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics.

-When the models generate strategies, GPT-4.5 exhibits second-order
+When the models generate strategies, <tt>GPT-4.5</tt> exhibits second-order
 rational behaviour in configurations (a), (c), and (d), but fails in
 configuration (b) to distinguish the optimal action from a nearly optimal one.
 Llama3 makes its decision randomly. Mistral-Small shows strong
 capabilities in generating second-order rational behaviour. DeepSeek-R1
 does not produce valid responses.

-When generating actions, Llama3 adapts to different types of beliefs
-and adjustments in the payoff matrix. GPT-4.5 performs well in the
+When generating actions, <tt>Llama3</tt> adapts to different types of beliefs
+and adjustments in the payoff matrix. <tt>GPT-4.5</tt> performs well in the
 initial configuration (a), but encounters significant difficulties when the
 payoff structure changes (b, c, d), particularly with implicit beliefs. Although
 Mistral-Small works well with given or explicit beliefs, it faces
 difficulties with implicit beliefs, especially in variant (d).
-DeepSeek-R1 does not appear to be a good candidate for simulating
+<tt>DeepSeek-R1</tt> does not appear to be a good candidate for simulating
 second-order rationality.

 When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations 
@@ -295,6 +298,13 @@ implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in
 performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). 
 Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.

+It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models 
+(like <tt>Mistral-Small</tt> and <tt>Deepseek-R1</tt>), especially under implicit and explicit conditions, 
+it often harms performance in larger models (e.g., <tt>LLama3.3</tt>, <tt>Mixtral:8x7b</tt>), 
+where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting 
+its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively — 
+particularly in less confident or under-specified contexts.
+

 | **Version**         |                | **a**     |              |              | **b**     |              |              | **c**     |              |              | **d**     |              |              |
 |---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
@@ -307,12 +317,22 @@ Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for i
 | **deepseek-r1:7b**  | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
 | **deepseek-r1**     | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
 | **gpt-4.5**         | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 0.67         | 0.00         | 0.86      | 0.83         | 0.00         | 0.50      | 0.90         | 0.00         |
-| **llama3.3:latest** | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.2          | 1.00      | 1.00         | 0.00         |
+|                     | actions + CR   | TODO      | TODO         | TODO         | TODO      | TODO         | TODO         | TODO      | TODO         | TODO         | TODO      | TODO         | TODO         |
+| **llama3.3:latest** | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.20         | 1.00      | 1.00         | 0.00         |
+|                     | actions + CR   | 1.00      | 1.00         | *0.96*       | *0.96*    | 1.00         | **0.96**     | 1.00      | 1.00         | **0.80**     | 1.00      | 1.00         | **0.90**     |
 | **llama3**          | actions        | 0.97      | 1.00         | 1.00         | 0.77      | 0.80         | 0.60         | 0.97      | 0.90         | 0.93         | 0.83      | 0.90         | 0.60         |
+|                     | actions + CR   | *0.90*    | *0.90*       | *0.86*       | *0.50*    | *0.50*       | *0.50*       | *0.76*    | 0.96         | *0.70*       | *0.67*    | *0.83*       | 0.67         |
 | **mixtral:8x7b**    | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | 0.73         |
+|                     | actions + CR   | 1.00      | *0.96*       | 1.00         | 1.00      | 1.00         | **1.0**      | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | *0.28*       |
 | **mistral-small**   | actions        | 0.93      | 0.97         | 1.00         | 0.87      | 0.77         | 0.60         | 0.77      | 0.60         | 0.70         | 0.73      | 0.57         | 0.37         |
+|                     | actions + CR   | **1.00**  | *0.93*       | 1.00         | **0.95**  | **0.96**     | **0.90**     | **0.90**  | **0.76**     | *0.43*       | *0.67*    | *0.40*       | 0.37         |
 | **deepseek-r1:7b**  | actions        | 1.00      | 0.96         | 1.00         | 1.00      | 1.00         | 0.93         | 0.96      | 1.00         | 0.92         | 0.96      | 1.00         | 0.79         |
-| **deepseek-r1**     | actions        | 0.80      | 0.53         | 0.57         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |
+|                     | actions + CR   | 1.00      | **1.00**     | 1.00         | 1.00      | 1.00         | **1.00**     | *0.90*    | 1.00         | **1.00**     | **1.00**  | 1.00         | **1.00**     |
+| **deepseek-r1**     | actions        | 0.80      | 0.53         | 0.56         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |
+|                     | actions + CR   | 0.80      | **0.63**     | **0.60**     | 0.67      | **0.63**     | **0.70**     | 0.67      | **0.70**     | **0.50**     | *0.63*    | **0.76**     | **0.70**     |
+
+
+

 Irrational decisions are explained by inference errors based on the natural
 language description of the payoff matrix. For example, in variant (d), the

--- a/data/ring/ring.1.a.False.csv
+++ b/data/ring/ring.1.a.False.csv
--- a/data/ring/ring.1.a.True.csv
+++ b/data/ring/ring.1.a.True.csv
--- a/data/ring/ring.1.a.csv
+++ b/data/ring/ring.1.a.csv
--- a/data/ring/ring.1.b.False.csv
+++ b/data/ring/ring.1.b.False.csv
--- a/data/ring/ring.1.b.True.csv
+++ b/data/ring/ring.1.b.True.csv
--- a/data/ring/ring.1.b.csv
+++ b/data/ring/ring.1.b.csv
--- a/data/ring/ring.1.c.False.csv
+++ b/data/ring/ring.1.c.False.csv
--- a/data/ring/ring.1.c.True.csv
+++ b/data/ring/ring.1.c.True.csv
--- a/data/ring/ring.1.c.csv
+++ b/data/ring/ring.1.c.csv
--- a/data/ring/ring.1.d.False.csv
+++ b/data/ring/ring.1.d.False.csv
--- a/data/ring/ring.1.d.True.csv
+++ b/data/ring/ring.1.d.True.csv
--- a/data/ring/ring.1.d.csv
+++ b/data/ring/ring.1.d.csv
--- a/data/ring/ring.2.a.False.csv
+++ b/data/ring/ring.2.a.False.csv
--- a/data/ring/ring.2.a.csv
+++ b/data/ring/ring.2.a.csv
--- a/figures/ring/ring_accuracy.1.a.csv
+++ b/figures/ring/ring_accuracy.1.a.csv
--- a/figures/ring/ring_accuracy.1.a.True.csv
+++ b/figures/ring/ring_accuracy.1.a.True.csv
+Model,Given,Explicit,Implicit
+deepseek-r1,0.8,0.6333333333333333,0.6
+deepseek-r1:7b,1.0,1.0,1.0
+llama3,0.9,0.9,0.8666666666666667
+llama3.3:latest,0.975,1.0,0.975
+mistral-small,1.0,0.9333333333333333,1.0
+mixtral:8x7b,1.0,0.975,1.0
--- a/figures/ring/ring_accuracy.1.b.True.csv
+++ b/figures/ring/ring_accuracy.1.b.True.csv
+Model,Given,Explicit,Implicit
+deepseek-r1,0.6666666666666666,0.6333333333333333,0.7
+deepseek-r1:7b,1.0,1.0,1.0
+llama3,0.5,0.5,0.5
+llama3.3:latest,0.9666666666666667,1.0,0.9666666666666667
+mistral-small,0.9666666666666667,0.9666666666666667,0.9
+mixtral:8x7b,1.0,1.0,1.0
--- a/figures/ring/ring_accuracy.1.b.csv
+++ b/figures/ring/ring_accuracy.1.b.csv