Skip to content
Snippets Groups Projects
Commit d5d9cec4 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

PyGAAMAS: Add Qwen3 for strategic reasoning

parent 0f771e33
No related branches found
No related tags found
No related merge requests found
Showing
with 867 additions and 37 deletions
......@@ -220,32 +220,34 @@ We first evaluate the rationality of the agents and then their second-order rati
Table below evaluates the models’ ability to generate rational
behaviour for Player 2.
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** |
|-------------------|--------------|-----------|--------------|--------------|
| <tt>gpt-4.5</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | strategy | 1.00 | 1.00 | 1.00 |
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** |
|--------------------------|--------------|-----------|--------------|--------------|
| <tt>gpt-4.5</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>llama3.3:latest</tt> | strategy | 1.00 | 1.00 | 0.50 |
| <tt>llama3</tt> | strategy | 0.50 | 0.50 | 0.50 |
| <tt>deepseek-r1:7b</tt> | strategy | - | - | - |
| <tt>deepseek-r1</tt> | strategy | - | - | - |
| **—** | **—** | **—** | **—** | **—** |
| <tt>gpt-4.5</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | actions | 1.00 | 1.00 | 0.87 |
| <tt>llama33:latest</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>llama3.3</tt> | actions | 1.00 | 0.90 | 0.17 |
| <tt>deepseek-r1:7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>deepseek-r1</tt> | actions | 0.83 | 0.57 | 0.60 |
| <tt>llama3</tt> | strategy | 0.50 | 0.50 | 0.50 |
| <tt>deepseek-r1:7b</tt> | strategy | - | - | - |
| <tt>deepseek-r1</tt> | strategy | - | - | - |
| <tt>qwen3</tt> | strategy | 0.00 | 0.00 | 0.00 |
| **—** | **—** | **—** | **—** | **—** |
| <tt>gpt-4.5</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | actions | 1.00 | 1.00 | 0.87 |
| <tt>llama33:latest</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>llama3.3</tt> | actions | 1.00 | 0.90 | 0.17 |
| <tt>deepseek-r1:7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>deepseek-r1</tt> | actions | 0.83 | 0.57 | 0.60 |
| <tt>qwen3</tt> | actions | 1.00 | 0.93 | 0.50 |
When generating strategies, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, and <tt>Mistral-Small</tt>
exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality.
exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality and <tt>Qwen3</tt> is irational.
<tt>Llama3.3:latest</tt> has the same behaviour with implicit beliefs.
<tt>Deepseek-R1:7b</tt> and <tt>DeepSeek-R1</tt> fails to generate valid strategies.
When generating actions, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, <tt>DeepSeek-R1:7b</tt>,
and <tt>Llama3.3:latest<</tt> demonstrate strong rational decision-making, even with implicit beliefs.
<tt>Mistral-Small</tt> performs well but slightly lags in handling implicit reasoning.
<tt>Mistral-Small</tt> and <tt>Qwen3</tt> performs well but lags in handling implicit reasoning.
<tt>Llama3</tt> struggles with implicit reasoning, while <tt>DeepSeek-R1</tt>
shows inconsistent performance.
Overall, <tt>GPT-4.5</tt> and <tt>Mixtral-8x7B</tt> are the most reliable models for generating rational behavior.
......@@ -293,19 +295,23 @@ except (b), where it fails to distinguish the optimal action from a nearly optim
showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt>
demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior.
<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
<tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.
<tt>Qwen3</tt> generate irrational strategies. <tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix
but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial
but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d),
especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs
in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with
implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version,
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
<tt>Qwen3</tt> performs robustly across most belief types, especially in configurations (a) and (b), maintaining
strong scores on both explicit and implicit conditions. However, like other models, it experiences a noticeable
drop in accuracy under implicit beliefs in configuration (d), suggesting sensitivity to deeper inferential reasoning.
It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models
(like <tt>Mistral-Small</tt> and <tt>Deepseek-R1</tt>), especially under implicit and explicit conditions,
(like <tt>Mistral-Small</tt>, <tt>Deepseek-R1</tt> and <tt>Qwen3</tt>), especially under implicit and explicit conditions,
it often harms performance in larger models (e.g., <tt>LLama3.3</tt>, <tt>Mixtral:8x7b</tt>),
where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting
its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively —
......@@ -322,6 +328,7 @@ particularly in less confident or under-specified contexts.
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **qwen3** | strategy | - | - | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| | actions + CR | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO | TODO |
| **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.20 | 1.00 | 1.00 | 0.00 |
......@@ -336,9 +343,8 @@ particularly in less confident or under-specified contexts.
| | actions + CR | 1.00 | **1.00** | 1.00 | 1.00 | 1.00 | **1.00** | *0.90* | 1.00 | **1.00** | **1.00** | 1.00 | **1.00** |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.56 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
| | actions + CR | 0.80 | **0.63** | **0.60** | 0.67 | **0.63** | **0.70** | 0.67 | **0.70** | **0.50** | *0.63* | **0.76** | **0.70** |
| **qwen3** | actions | 1.00 | 1.00 | 1.00 | 0.90 | 0.96 | 1.00 | 1.00 | 0.96 | 0.70 | 1.00 | 0.96 | 0.46 |
| | actions + CR | 1.00 | 1.00 | 1.00 | **1.00** | **1.00** | 1.00 | *0.96* | **1.00** | **1.00** | *0.96* | 0.96 | **0.83** |
Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
......@@ -347,14 +353,6 @@ follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."
In summary, the results indicate that GPT-4.5 and
Mistral-Small generally adopt first- and second-order rational
behaviours. However, GPT-4.5 struggles to distinguish an optimal action
from a nearly optimal one, while Mistral-Small encounters difficulties
with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions.
In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and
second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has
reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with
......@@ -363,6 +361,10 @@ given beliefs but struggles significantly with implicit beliefs, limiting its ef
decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with
implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.
Qwen3 struggles to generate valid strategies, reflecting limited high-level planning. However, it shows strong
first-order rationality when producing actions, especially under explicit or guided conditions,
and benefits from conditional reasoning. Its performance declines with implicit beliefs, highlighting limitations
in deeper inference.
## Beliefs
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
......@@ -5,3 +5,4 @@ llama3,0.9,0.9,0.8666666666666667
llama3.3:latest,0.975,1.0,0.975
mistral-small,1.0,0.9333333333333333,1.0
mixtral:8x7b,1.0,0.975,1.0
qwen3,1.0,1.0,1.0
......@@ -5,3 +5,4 @@ llama3,0.5,0.5,0.5
llama3.3:latest,0.9666666666666667,1.0,0.9666666666666667
mistral-small,0.9666666666666667,0.9666666666666667,0.9
mixtral:8x7b,1.0,1.0,1.0
qwen3,1.0,0.8666666666666667,1.0
......@@ -6,3 +6,4 @@ llama3,0.7666666666666667,0.8,0.6
llama3.3:latest,1.0,1.0,0.5
mistral-small,0.8666666666666667,0.7666666666666667,0.6
mixtral:8x7b,1.0,1.0,0.5
qwen3,0.9,0.9666666666666667,1.0
......@@ -5,3 +5,4 @@ llama3,0.7666666666666667,0.9666666666666667,0.7
llama3.3:latest,1.0,1.0,0.8
mistral-small,0.9,0.7666666666666667,0.43333333333333335
mixtral:8x7b,1.0,1.0,1.0
qwen3,0.9666666666666667,1.0,1.0
......@@ -6,3 +6,4 @@ llama3,0.9666666666666667,0.9,0.9333333333333333
llama3.3:latest,1.0,1.0,0.2
mistral-small,0.7666666666666667,0.6,0.7
mixtral:8x7b,1.0,1.0,1.0
qwen3,1.0,0.9666666666666667,1.0
......@@ -5,3 +5,4 @@ llama3,0.6666666666666666,0.8333333333333334,0.6666666666666666
llama3.3:latest,1.0,1.0,0.9
mistral-small,0.6666666666666666,0.4,0.36666666666666664
mixtral:8x7b,1.0,0.8,0.2857142857142857
qwen3,0.9666666666666667,0.9666666666666667,0.8333333333333334
......@@ -6,3 +6,4 @@ llama3,0.8333333333333334,0.9,0.6
llama3.3:latest,1.0,1.0,0.0
mistral-small,0.7333333333333333,0.5666666666666667,0.36666666666666664
mixtral:8x7b,1.0,1.0,0.7333333333333333
qwen3,1.0,0.9666666666666667,0.4666666666666667
......@@ -6,3 +6,4 @@ llama3,1.0,0.9,0.16666666666666666
llama3.3:latest,1.0,1.0,1.0
mistral-small,1.0,1.0,0.8666666666666667
mixtral:8x7b,1.0,1.0,0.5
qwen3,1.0,0.9333333333333333,0.5
Model,Given,Explicit,Implicit
deepseek-r1,0.8333333333333334,0.5666666666666667,0.6
llama3,1.0,0.9,0.16666666666666666
mistral-small,1.0,1.0,0.8666666666666667
......@@ -242,6 +242,13 @@ class Ring:
elif self.player_id == 2:
action = self.A
reasoning = f"Player {self.player_id} always chooses B as per the predefined strategy."
if self.model == "qwen3":
if self.player_id == 1:
action = self.Y
reasoning = f"Player {self.player_id} always chooses Y as per the predefined strategy."
elif self.player_id == 2:
action = self.B
reasoning = f"Player {self.player_id} always chooses B as per the predefined strategy."
if self.model == "deepseek-r1:7b" or self.model == "deepseek-r1":
raise ValueError("Invalid strategy for deepseek-r1.")
# Validate the rationality of the chosen action
......@@ -346,6 +353,6 @@ class Ring:
# Run the async function and return the response
if __name__ == "__main__":
game_agent = Ring(1, Belief.EXPLICIT, use_conditional_reasoning=False, swap = False, version="d", model="llama3.3:latest", temperature=0.7, strategy = False)# "llama3.3:latest", "mixtral:8x7b", "deepseek-r1:7b"
game_agent = Ring(2, Belief.EXPLICIT, use_conditional_reasoning=False, swap = False, version="d", model="llama3.3:latest", temperature=0.7, strategy = False)# "llama3.3:latest", "mixtral:8x7b", "deepseek-r1:7b"
response_json = asyncio.run(game_agent.run())
print(response_json)
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment