Skip to content
Snippets Groups Projects
Commit 78406e1b authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

Evaluate second order rationality with Pagoda

parent cdb127b7
No related branches found
No related tags found
No related merge requests found
......@@ -18,7 +18,6 @@ response to other agents’ behaviours.
## Economic Rationality
## Evaluating Economic Rationality in LLMs
To evaluate the economic rationality of various LLMs, we introduce an investment game
designed to test whether these models follow stable decision-making patterns or react
......@@ -126,7 +125,8 @@ each corresponding to one of the four preferences:
- The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).
Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
- When generating **strategies**, the models align perfectly with preferences, except for
- <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
- When generating **actions**,
- <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
- <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
......@@ -278,23 +278,40 @@ difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality.
When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly,
showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt>
demonstrate strong capabilities across all conditions, consistently generating second-order rational behavior.
<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
<tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.
When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix
but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d),
especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs
in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with
implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version,
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d).
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **llama3.3:latest** | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 |
| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
| **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3.3:latest** | actions | 0.97TODO | 1.00TODO | 1.00TODO | 0.77TODO | 0.80TODO | 0.60TODO | 0.97TODO | 0.90TODO | 0.93TODO | 0.83TODO | 0.90TODO | 0.60TODO |
| **llama3.3:latest** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.2 | 1.00 | 1.00 | 0.00 |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| **mixtral:8x7b** | actions | 0.93TODO | 0.97TODO | 1.00TODO | 0.87TODO | 0.77TODO | 0.60TODO | 0.77TODO | 0.60TODO | 0.70TODO | 0.73TODO | 0.57TODO | 0.37TODO |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1:7b** | actions | 0.80TODO | 0.53TODO | 0.57TODO | 0.67TODO | 0.60TODO | 0.53TODO | 0.67TODO | 0.63TODO | 0.47TODO | 0.70TODO | 0.50TODO | 0.57TODO |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
| **mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
......@@ -311,6 +328,15 @@ with implicit beliefs. Llama3 generates strategies randomly but adapts
better when producing specific actions. In contrast, DeepSeek-R1 fails
to provide valid strategies and generates irrational actions.
In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and
second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has
reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with
implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with
implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases.
## Beliefs
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Model,Given,Explicit,Implicit
deepseek-r1,0.8,0.5333333333333333,0.5666666666666667
deepseek-r1:7b,1.0,0.9666666666666667,1.0
gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
llama3,0.9666666666666667,1.0,1.0
llama3.3:latest,1.0,1.0,1.0
mistral-small,0.9333333333333333,0.9666666666666667,1.0
mixtral:8x7b,1.0,1.0,1.0
Model,Given,Explicit,Implicit
deepseek-r1,0.6666666666666666,0.6,0.5333333333333333
deepseek-r1:7b,1.0,1.0,0.9230769230769231
gpt-4.5-preview-2025-02-27,1.0,0.7666666666666667,0.0
llama3,0.7666666666666667,0.8,0.6
llama3.3:latest,1.0,1.0,0.5
mistral-small,0.8666666666666667,0.7666666666666667,0.6
mixtral:8x7b,1.0,1.0,0.5
Model,Given,Explicit,Implicit
deepseek-r1,0.6666666666666666,0.6333333333333333,0.4666666666666667
deepseek-r1:7b,0.9666666666666667,1.0,0.9259259259259259
gpt-4.5-preview-2025-02-27,0.8666666666666667,0.8333333333333334,0.0
llama3,0.9666666666666667,0.9,0.9333333333333333
llama3.3:latest,1.0,1.0,0.2
mistral-small,0.7666666666666667,0.6,0.7
mixtral:8x7b,1.0,1.0,1.0
Model,Given,Explicit,Implicit
deepseek-r1,0.7,0.5,0.5666666666666667
deepseek-r1:7b,0.9666666666666667,1.0,0.7931034482758621
gpt-4.5-preview-2025-02-27,0.5,0.9,0.0
llama3,0.8333333333333334,0.9,0.6
llama3.3:latest,1.0,1.0,0.0
mistral-small,0.7333333333333333,0.5666666666666667,0.36666666666666664
mixtral:8x7b,1.0,1.0,0.7333333333333333
......@@ -81,7 +81,7 @@ if __name__ == "__main__":
temperature = 0.7
iterations = 30
player_id = 1
version = "a"
version = "d"
output_file = f"../../data/ring/ring.{player_id}.{version}.csv"
experiment = RingExperiment(models=models, player_id = player_id, version = version, temperature = temperature, iterations=iterations, output_file = output_file)
asyncio.run(experiment.run_experiment())
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment