Skip to content
Snippets Groups Projects
Commit 78406e1b authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

Evaluate second order rationality with Pagoda

parent cdb127b7
No related branches found
No related tags found
No related merge requests found
...@@ -18,7 +18,6 @@ response to other agents’ behaviours. ...@@ -18,7 +18,6 @@ response to other agents’ behaviours.
## Economic Rationality ## Economic Rationality
## Evaluating Economic Rationality in LLMs
To evaluate the economic rationality of various LLMs, we introduce an investment game To evaluate the economic rationality of various LLMs, we introduce an investment game
designed to test whether these models follow stable decision-making patterns or react designed to test whether these models follow stable decision-making patterns or react
...@@ -126,7 +125,8 @@ each corresponding to one of the four preferences: ...@@ -126,7 +125,8 @@ each corresponding to one of the four preferences:
- The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**). - The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).
Table below evaluates the ability of the models to align with different preferences. Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code. - When generating **strategies**, the models align perfectly with preferences, except for
- <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
- When generating **actions**, - When generating **actions**,
- <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**. - <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
- <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices. - <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
...@@ -278,23 +278,40 @@ difficulties with implicit beliefs, especially in variant (d). ...@@ -278,23 +278,40 @@ difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality. second-order rationality.
When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly,
showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt>
demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior.
<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
<tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix
but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d),
especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs
in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with
implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version,
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | | | **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------| |---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | | **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **llama3.3:latest** | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | | **llama3.3:latest** | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 |
| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | | **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
| **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | | **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - | | **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 | | **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3.3:latest** | actions | 0.97TODO | 1.00TODO | 1.00TODO | 0.77TODO | 0.80TODO | 0.60TODO | 0.97TODO | 0.90TODO | 0.93TODO | 0.83TODO | 0.90TODO | 0.60TODO | | **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.2 | 1.00 | 1.00 | 0.00 |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 | | **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| **mixtral:8x7b** | actions | 0.93TODO | 0.97TODO | 1.00TODO | 0.87TODO | 0.77TODO | 0.60TODO | 0.77TODO | 0.60TODO | 0.70TODO | 0.73TODO | 0.57TODO | 0.37TODO | | **mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | | **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1:7b** | actions | 0.80TODO | 0.53TODO | 0.57TODO | 0.67TODO | 0.60TODO | 0.53TODO | 0.67TODO | 0.63TODO | 0.47TODO | 0.70TODO | 0.50TODO | 0.57TODO | | **deepseek-r1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 | | **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
Irrational decisions are explained by inference errors based on the natural Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the language description of the payoff matrix. For example, in variant (d), the
...@@ -311,6 +328,15 @@ with implicit beliefs. Llama3 generates strategies randomly but adapts ...@@ -311,6 +328,15 @@ with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions. to provide valid strategies and generates irrational actions.
In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and
second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has
reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with
implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with
implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.
## Beliefs ## Beliefs
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Model,Given,Explicit,Implicit Model,Given,Explicit,Implicit
deepseek-r1,0.8,0.5333333333333333,0.5666666666666667 deepseek-r1,0.8,0.5333333333333333,0.5666666666666667
deepseek-r1:7b,1.0,0.9666666666666667,1.0
gpt-4.5-preview-2025-02-27,1.0,1.0,1.0 gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
llama3,0.9666666666666667,1.0,1.0 llama3,0.9666666666666667,1.0,1.0
llama3.3:latest,1.0,1.0,1.0
mistral-small,0.9333333333333333,0.9666666666666667,1.0 mistral-small,0.9333333333333333,0.9666666666666667,1.0
mixtral:8x7b,1.0,1.0,1.0
Model,Given,Explicit,Implicit Model,Given,Explicit,Implicit
deepseek-r1,0.6666666666666666,0.6,0.5333333333333333 deepseek-r1,0.6666666666666666,0.6,0.5333333333333333
deepseek-r1:7b,1.0,1.0,0.9230769230769231
gpt-4.5-preview-2025-02-27,1.0,0.7666666666666667,0.0 gpt-4.5-preview-2025-02-27,1.0,0.7666666666666667,0.0
llama3,0.7666666666666667,0.8,0.6 llama3,0.7666666666666667,0.8,0.6
llama3.3:latest,1.0,1.0,0.5
mistral-small,0.8666666666666667,0.7666666666666667,0.6 mistral-small,0.8666666666666667,0.7666666666666667,0.6
mixtral:8x7b,1.0,1.0,0.5
Model,Given,Explicit,Implicit Model,Given,Explicit,Implicit
deepseek-r1,0.6666666666666666,0.6333333333333333,0.4666666666666667 deepseek-r1,0.6666666666666666,0.6333333333333333,0.4666666666666667
deepseek-r1:7b,0.9666666666666667,1.0,0.9259259259259259
gpt-4.5-preview-2025-02-27,0.8666666666666667,0.8333333333333334,0.0 gpt-4.5-preview-2025-02-27,0.8666666666666667,0.8333333333333334,0.0
llama3,0.9666666666666667,0.9,0.9333333333333333 llama3,0.9666666666666667,0.9,0.9333333333333333
llama3.3:latest,1.0,1.0,0.2
mistral-small,0.7666666666666667,0.6,0.7 mistral-small,0.7666666666666667,0.6,0.7
mixtral:8x7b,1.0,1.0,1.0
Model,Given,Explicit,Implicit Model,Given,Explicit,Implicit
deepseek-r1,0.7,0.5,0.5666666666666667 deepseek-r1,0.7,0.5,0.5666666666666667
deepseek-r1:7b,0.9666666666666667,1.0,0.7931034482758621
gpt-4.5-preview-2025-02-27,0.5,0.9,0.0 gpt-4.5-preview-2025-02-27,0.5,0.9,0.0
llama3,0.8333333333333334,0.9,0.6 llama3,0.8333333333333334,0.9,0.6
llama3.3:latest,1.0,1.0,0.0
mistral-small,0.7333333333333333,0.5666666666666667,0.36666666666666664 mistral-small,0.7333333333333333,0.5666666666666667,0.36666666666666664
mixtral:8x7b,1.0,1.0,0.7333333333333333
...@@ -81,7 +81,7 @@ if __name__ == "__main__": ...@@ -81,7 +81,7 @@ if __name__ == "__main__":
temperature = 0.7 temperature = 0.7
iterations = 30 iterations = 30
player_id = 1 player_id = 1
version = "a" version = "d"
output_file = f"../../data/ring/ring.{player_id}.{version}.csv" output_file = f"../../data/ring/ring.{player_id}.{version}.csv"
experiment = RingExperiment(models=models, player_id = player_id, version = version, temperature = temperature, iterations=iterations, output_file = output_file) experiment = RingExperiment(models=models, player_id = player_id, version = version, temperature = temperature, iterations=iterations, output_file = output_file)
asyncio.run(experiment.run_experiment()) asyncio.run(experiment.run_experiment())
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment