Evaluate second order rationality with Pagoda

78406e1b · Maxime Morge · cdb127b7 · 78406e1b · 78406e1b · 78406e1b
Commit 78406e1b authored 4 months ago by Maxime Morge
--- a/README.md
+++ b/README.md
@@ -18,7 +18,6 @@ response to other agents’ behaviours.

 ## Economic Rationality

-## Evaluating Economic Rationality in LLMs

 To evaluate the economic rationality of various LLMs, we introduce an investment game 
 designed to test whether these models follow stable decision-making patterns or react 
@@ -126,7 +125,8 @@ each corresponding to one of the four preferences:
 - The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).

 Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
+- When generating **strategies**, the models align perfectly with preferences, except for 
+- <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
 - When generating **actions**, 
  - <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
  - <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
@@ -278,23 +278,40 @@ difficulties with implicit beliefs, especially in variant (d).
 DeepSeek-R1 does not appear to be a good candidate for simulating
 second-order rationality.

+When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations 
+except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, 
+showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt> 
+demonstrate strong  capabilities across all conditions, consistently generating second-order rational behavior. 
+<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
+<tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.
+
+When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix 
+but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial 
+configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), 
+especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs 
+in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with 
+implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version, 
+performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). 
+Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
+
+
 | **Version**         |                | **a**     |              |              | **b**     |              |              | **c**     |              |              | **d**     |              |              |
 |---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
-| **Model**           | **Generation**  | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
+| **Model**           | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
 | **gpt-4.5**         | strategy       | 1.00      | 1.00         | 1.00         | 0.00      | 0.00         | 0.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
 | **llama3.3:latest** | strategy       | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         |
 | **llama3**          | strategy       | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         |
-| **mixtral:8x7b**    | strategy | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
-| **mistral-small**   | strategy | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
-| **deepseek-r1:7b**  | strategy | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
-| **deepseek-r1**     | strategy | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
+| **mixtral:8x7b**    | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
+| **mistral-small**   | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
+| **deepseek-r1:7b**  | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
+| **deepseek-r1**     | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
 | **gpt-4.5**         | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 0.67         | 0.00         | 0.86      | 0.83         | 0.00         | 0.50      | 0.90         | 0.00         |
-| **llama3.3:latest** | actions        | 0.97TODO  | 1.00TODO     | 1.00TODO     | 0.77TODO  | 0.80TODO     | 0.60TODO     | 0.97TODO  | 0.90TODO     | 0.93TODO     | 0.83TODO  | 0.90TODO     | 0.60TODO     |
+| **llama3.3:latest** | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.2          | 1.00      | 1.00         | 0.00         |
 | **llama3**          | actions        | 0.97      | 1.00         | 1.00         | 0.77      | 0.80         | 0.60         | 0.97      | 0.90         | 0.93         | 0.83      | 0.90         | 0.60         |
-| **mixtral:8x7b**    | actions | 0.93TODO  | 0.97TODO     | 1.00TODO     | 0.87TODO  | 0.77TODO     | 0.60TODO     | 0.77TODO  | 0.60TODO     | 0.70TODO     | 0.73TODO  | 0.57TODO     | 0.37TODO     |
-| **mistral-small**   | actions | 0.93      | 0.97         | 1.00         | 0.87      | 0.77         | 0.60         | 0.77      | 0.60         | 0.70         | 0.73      | 0.57         | 0.37         |
-| **deepseek-r1:7b**  | actions  | 0.80TODO  | 0.53TODO     | 0.57TODO     | 0.67TODO  | 0.60TODO     | 0.53TODO     | 0.67TODO  | 0.63TODO     | 0.47TODO     | 0.70TODO  | 0.50TODO     | 0.57TODO     |
-| **deepseek-r1**     | actions  | 0.80      | 0.53         | 0.57         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |
+| **mixtral:8x7b**    | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | 0.73         |
+| **mistral-small**   | actions        | 0.93      | 0.97         | 1.00         | 0.87      | 0.77         | 0.60         | 0.77      | 0.60         | 0.70         | 0.73      | 0.57         | 0.37         |
+| **deepseek-r1:7b**  | actions        | 1.00      | 0.96         | 1.00         | 1.00      | 1.00         | 0.93         | 0.96      | 1.00         | 0.92         | 0.96      | 1.00         | 0.79         |
+| **deepseek-r1**     | actions        | 0.80      | 0.53         | 0.57         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |

 Irrational decisions are explained by inference errors based on the natural
 language description of the payoff matrix. For example, in variant (d), the
@@ -311,6 +328,15 @@ with implicit beliefs. Llama3 generates strategies randomly but adapts
 better when producing specific actions. In contrast, DeepSeek-R1 fails
 to provide valid strategies and generates irrational actions.

+In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and 
+second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has 
+reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with 
+implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or 
+given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
+decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with 
+implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit 
+inconsistent and often irrational decision-making, failing to generate valid strategies in many cases. 
+

 ## Beliefs


--- a/data/ring/ring.1.a.csv
+++ b/data/ring/ring.1.a.csv
--- a/data/ring/ring.1.b.csv
+++ b/data/ring/ring.1.b.csv
--- a/data/ring/ring.1.c.csv
+++ b/data/ring/ring.1.c.csv
--- a/data/ring/ring.1.d.csv
+++ b/data/ring/ring.1.d.csv
--- a/figures/ring/ring_accuracy.1.a.csv
+++ b/figures/ring/ring_accuracy.1.a.csv
 Model,Given,Explicit,Implicit
 deepseek-r1,0.8,0.5333333333333333,0.5666666666666667
+deepseek-r1:7b,1.0,0.9666666666666667,1.0
 gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
 llama3,0.9666666666666667,1.0,1.0
+llama3.3:latest,1.0,1.0,1.0
 mistral-small,0.9333333333333333,0.9666666666666667,1.0
+mixtral:8x7b,1.0,1.0,1.0
--- a/figures/ring/ring_accuracy.1.b.csv
+++ b/figures/ring/ring_accuracy.1.b.csv
 Model,Given,Explicit,Implicit
 deepseek-r1,0.6666666666666666,0.6,0.5333333333333333
+deepseek-r1:7b,1.0,1.0,0.9230769230769231
 gpt-4.5-preview-2025-02-27,1.0,0.7666666666666667,0.0
 llama3,0.7666666666666667,0.8,0.6
+llama3.3:latest,1.0,1.0,0.5
 mistral-small,0.8666666666666667,0.7666666666666667,0.6
+mixtral:8x7b,1.0,1.0,0.5
--- a/figures/ring/ring_accuracy.1.c.csv
+++ b/figures/ring/ring_accuracy.1.c.csv
 Model,Given,Explicit,Implicit
 deepseek-r1,0.6666666666666666,0.6333333333333333,0.4666666666666667
+deepseek-r1:7b,0.9666666666666667,1.0,0.9259259259259259
 gpt-4.5-preview-2025-02-27,0.8666666666666667,0.8333333333333334,0.0
 llama3,0.9666666666666667,0.9,0.9333333333333333
+llama3.3:latest,1.0,1.0,0.2
 mistral-small,0.7666666666666667,0.6,0.7
+mixtral:8x7b,1.0,1.0,1.0
--- a/figures/ring/ring_accuracy.1.d.csv
+++ b/figures/ring/ring_accuracy.1.d.csv
 Model,Given,Explicit,Implicit
 deepseek-r1,0.7,0.5,0.5666666666666667
+deepseek-r1:7b,0.9666666666666667,1.0,0.7931034482758621
 gpt-4.5-preview-2025-02-27,0.5,0.9,0.0
 llama3,0.8333333333333334,0.9,0.6
+llama3.3:latest,1.0,1.0,0.0
 mistral-small,0.7333333333333333,0.5666666666666667,0.36666666666666664
+mixtral:8x7b,1.0,1.0,0.7333333333333333
--- a/src/ring/ring_experiments.py
+++ b/src/ring/ring_experiments.py
@@ -81,7 +81,7 @@ if __name__ == "__main__":
    temperature = 0.7
    iterations = 30
    player_id = 1
-    version = "a"
+    version = "d"
    output_file = f"../../data/ring/ring.{player_id}.{version}.csv"
    experiment = RingExperiment(models=models, player_id = player_id, version = version, temperature = temperature, iterations=iterations, output_file = output_file)
    asyncio.run(experiment.run_experiment())