Skip to content
Snippets Groups Projects
Commit b0169be4 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

PyGAAMAS: evaluate CR excepted for GPT 4.5

parent 07028013
No related branches found
No related tags found
No related merge requests found
Showing
with 5117 additions and 3171 deletions
<?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?>
<module version="4"> <module version="4">
<component name="NewModuleRootManager"> <component name="PyDocumentationSettings">
<orderEntry type="jdk" jdkName="Python 3.13" jdkType="Python SDK" /> <option name="format" value="PLAIN" />
<orderEntry type="sourceFolder" forTests="false" /> <option name="myDocStringFormat" value="Plain" />
</component> </component>
</module> </module>
\ No newline at end of file
...@@ -259,24 +259,27 @@ the good choice X. ...@@ -259,24 +259,27 @@ the good choice X.
| **X** | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 | | **X** | 15 | 5 | 8 | 7 | 6 | 5 | 15 | 5 |
| **Y** | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 | | **Y** | 0 | 10 | 7 | 8 | 0 | 10 | 0 | 40 |
We introduce a prompt engineering method that incorporates Conditional Reasoning (CR), prompting the model to evaluate
an opponent’s optimal response to each of its own possible actions to encourage strategic foresight and
informed decision-making.
Table below evaluates the models' ability to generate second-order Table below evaluates the models' ability to generate second-order rational behaviour for player 1. The configurations
rational behaviour for player 1. where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics.
When the models generate strategies, GPT-4.5 exhibits second-order When the models generate strategies, <tt>GPT-4.5</tt> exhibits second-order
rational behaviour in configurations (a), (c), and (d), but fails in rational behaviour in configurations (a), (c), and (d), but fails in
configuration (b) to distinguish the optimal action from a nearly optimal one. configuration (b) to distinguish the optimal action from a nearly optimal one.
Llama3 makes its decision randomly. Mistral-Small shows strong Llama3 makes its decision randomly. Mistral-Small shows strong
capabilities in generating second-order rational behaviour. DeepSeek-R1 capabilities in generating second-order rational behaviour. DeepSeek-R1
does not produce valid responses. does not produce valid responses.
When generating actions, Llama3 adapts to different types of beliefs When generating actions, <tt>Llama3</tt> adapts to different types of beliefs
and adjustments in the payoff matrix. GPT-4.5 performs well in the and adjustments in the payoff matrix. <tt>GPT-4.5</tt> performs well in the
initial configuration (a), but encounters significant difficulties when the initial configuration (a), but encounters significant difficulties when the
payoff structure changes (b, c, d), particularly with implicit beliefs. Although payoff structure changes (b, c, d), particularly with implicit beliefs. Although
Mistral-Small works well with given or explicit beliefs, it faces Mistral-Small works well with given or explicit beliefs, it faces
difficulties with implicit beliefs, especially in variant (d). difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating <tt>DeepSeek-R1</tt> does not appear to be a good candidate for simulating
second-order rationality. second-order rationality.
When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations
...@@ -295,6 +298,13 @@ implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in ...@@ -295,6 +298,13 @@ implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs. Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models
(like <tt>Mistral-Small</tt> and <tt>Deepseek-R1</tt>), especially under implicit and explicit conditions,
it often harms performance in larger models (e.g., <tt>LLama3.3</tt>, <tt>Mixtral:8x7b</tt>),
where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting
its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively —
particularly in less confident or under-specified contexts.
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | | | **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------| |---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
...@@ -307,12 +317,22 @@ Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for i ...@@ -307,12 +317,22 @@ Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for i
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 | | **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.2 | 1.00 | 1.00 | 0.00 | | | actions + CR | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
| **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.20 | 1.00 | 1.00 | 0.00 |
| | actions + CR | 1.00 | 1.00 | *0.96* | *0.96* | 1.00 | **0.96** | 1.00 | 1.00 | **0.80** | 1.00 | 1.00 | **0.90** |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 | | **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| | actions + CR | *0.90* | *0.90* | *0.86* | *0.50* | *0.50* | *0.50* | *0.76* | 0.96 | *0.70* | *0.67* | *0.83* | 0.67 |
| **mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 | | **mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 |
| | actions + CR | 1.00 | *0.96* | 1.00 | 1.00 | 1.00 | **1.0** | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | *0.28* |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | | **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| | actions + CR | **1.00** | *0.93* | 1.00 | **0.95** | **0.96** | **0.90** | **0.90** | **0.76** | *0.43* | *0.67* | *0.40* | 0.37 |
| **deepseek-r1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 | | **deepseek-r1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 | | | actions + CR | 1.00 | **1.00** | 1.00 | 1.00 | 1.00 | **1.00** | *0.90* | 1.00 | **1.00** | **1.00** | 1.00 | **1.00** |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.56 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
| | actions + CR | 0.80 | **0.63** | **0.60** | 0.67 | **0.63** | **0.70** | 0.67 | **0.70** | **0.50** | *0.63* | **0.76** | **0.70** |
Irrational decisions are explained by inference errors based on the natural Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the language description of the payoff matrix. For example, in variant (d), the
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Model,Given,Explicit,Implicit
deepseek-r1,0.8,0.6333333333333333,0.6
deepseek-r1:7b,1.0,1.0,1.0
llama3,0.9,0.9,0.8666666666666667
llama3.3:latest,0.975,1.0,0.975
mistral-small,1.0,0.9333333333333333,1.0
mixtral:8x7b,1.0,0.975,1.0
Model,Given,Explicit,Implicit
deepseek-r1,0.6666666666666666,0.6333333333333333,0.7
deepseek-r1:7b,1.0,1.0,1.0
llama3,0.5,0.5,0.5
llama3.3:latest,0.9666666666666667,1.0,0.9666666666666667
mistral-small,0.9666666666666667,0.9666666666666667,0.9
mixtral:8x7b,1.0,1.0,1.0
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment