Skip to content
Snippets Groups Projects
Commit cdb127b7 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

Evaluate first order rationality with Pagoda

parent 3f4e0626
No related branches found
No related tags found
No related merge requests found
......@@ -120,19 +120,19 @@ We define four preferences for the dictator, each corresponding to a distinct fo
We consider four allocation options where part of the money is lost in the division process,
each corresponding to one of the four preferences:
- The dictator keeps **$500**, the recipient receives **$100**, and a total of **$400** is lost (**egoistic**).
- The dictator keeps **$100**, the recipient receives **$500**, and **$400** is lost (**altruistic**).
- The dictator keeps **$400**, the recipient receives **$300**, resulting in a loss of **$300** (**utilitarian**).
- The dictator keeps **$325**, the other player receives **$325**, and **$350** is lost (**egalitarian**).
- The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (**egoistic**).
- The dictator keeps **$100, the recipient receives $500, and $400 is lost (**altruistic**).
- The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (**utilitarian**).
- The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).
Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code.
- When generating **actions**,
- <tt>GPT-4.5<tt> aligns well with preferences but struggles with **utilitarianism**.
- <tt>Llama3<tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
- <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
- <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
- <tt>Mistral-Small</tt> aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences.
- <tt>DeepSeek-R1</tt> primarily aligns with **utilitarianism** but has low accuracy in other preferences.
While a larger LLM typically aligns better with preferences, a model like Mixtral-8x7B may occasionally
While a larger LLM typically aligns better with preferences, a model like <tt>Mixtral-8x7B</tt> may occasionally
underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity.
Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters.
If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.
......@@ -213,26 +213,35 @@ We first evaluate the rationality of the agents and then their second-order rati
Table below evaluates the models’ ability to generate rational
behaviour for Player 2.
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** |
|--------------------|--------------|----------|------------|------------|
| `gpt-4.5` | strategy | 1.00 | 1.00 | 1.00 |
| `mistral-small` | strategy | 1.00 | 1.00 | 1.00 |
| `llama3` | strategy | 0.50 | 0.50 | 0.50 |
| `deepseek-r1` | strategy | - | - | - |
| **—** | **—** | **—** | **—** | **—** |
| `gpt-4.5` | actions | 1.00 | 1.00 | 1.00 |
| `mistral-small` | actions | 1.00 | 1.00 | 0.87 |
| `llama3` | actions | 1.00 | 0.90 | 0.17 |
| `deepseek-r1` | actions | 0.83 | 0.57 | 0.60 |
When generating strategies, GPT-4.5 and Mistral-Small exhibit
rational behaviour, whereas Llama3 adopts a random strategy.
DeepSeek-R1 fails to generate valid output. When generating actions,
GPT-4.5 demonstrates its ability to make rational decisions, even with
implicit beliefs. Mistral-Small outperforms other open-weight models.
Llama3 struggles to infer optimal actions based solely on implicit
beliefs. DeepSeek-R1 is not a good candidate for simulating
rationality.
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** |
|-------------------|--------------|-----------|--------------|--------------|
| <tt>gpt-4.5</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | strategy | 1.00 | 1.00 | 1.00 |
| <tt>llama3.3:latest</tt> | strategy | 1.00 | 1.00 | 0.50 |
| <tt>llama3</tt> | strategy | 0.50 | 0.50 | 0.50 |
| <tt>deepseek-r1:7b</tt> | strategy | - | - | - |
| <tt>deepseek-r1</tt> | strategy | - | - | - |
| **—** | **—** | **—** | **—** | **—** |
| <tt>gpt-4.5</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mixtral:8x7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>mistral-small</tt> | actions | 1.00 | 1.00 | 0.87 |
| <tt>llama33:latest</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>llama3.3</tt> | actions | 1.00 | 0.90 | 0.17 |
| <tt>deepseek-r1:7b</tt> | actions | 1.00 | 1.00 | 1.00 |
| <tt>deepseek-r1</tt> | actions | 0.83 | 0.57 | 0.60 |
When generating strategies, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, and <tt>Mistral-Small</tt>
exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality.
<tt>Llama3.3:latest</tt> has the same behaviour with implicit beliefs.
<tt>Deepseek-R1:7b</tt> and <tt>DeepSeek-R1</tt> fails to generate valid strategies.
When generating actions, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, <tt>DeepSeek-R1:7b</tt>,
and <tt>Llama3.3:latest<</tt> demonstrate strong rational decision-making, even with implicit beliefs.
<tt>Mistral-Small</tt> performs well but slightly lags in handling implicit reasoning.
<tt>Llama3</tt> struggles with implicit reasoning, while <tt>DeepSeek-R1</tt>
shows inconsistent performance.
Overall, <tt>GPT-4.5</tt> and <tt>Mixtral-8x7B</tt> are the most reliable models for generating rational behavior.
### Second-Order Rationality
......@@ -269,17 +278,23 @@ difficulties with implicit beliefs, especially in variant (d).
DeepSeek-R1 does not appear to be a good candidate for simulating
second-order rationality.
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|-------------|----------------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|---------------|----------|----------|
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
| **Version** | | **a** | | | **b** | | | **c** | | | **d** | | |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
| **Model** | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **gpt-4.5** | strategy | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **llama3.3:latest** | strategy | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 | 1.00 | 1.00 | 0.50 |
| **llama3** | strategy | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
| **mixtral:8x7b** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **mistral-small** | strategy | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| **deepseek-r1:7b** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **deepseek-r1** | strategy | - | - | - | - | - | - | - | - | - | - | - | - |
| **gpt-4.5** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 0.67 | 0.00 | 0.86 | 0.83 | 0.00 | 0.50 | 0.90 | 0.00 |
| **llama3.3:latest** | actions | 0.97TODO | 1.00TODO | 1.00TODO | 0.77TODO | 0.80TODO | 0.60TODO | 0.97TODO | 0.90TODO | 0.93TODO | 0.83TODO | 0.90TODO | 0.60TODO |
| **llama3** | actions | 0.97 | 1.00 | 1.00 | 0.77 | 0.80 | 0.60 | 0.97 | 0.90 | 0.93 | 0.83 | 0.90 | 0.60 |
| **mixtral:8x7b** | actions | 0.93TODO | 0.97TODO | 1.00TODO | 0.87TODO | 0.77TODO | 0.60TODO | 0.77TODO | 0.60TODO | 0.70TODO | 0.73TODO | 0.57TODO | 0.37TODO |
| **mistral-small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 |
| **deepseek-r1:7b** | actions | 0.80TODO | 0.53TODO | 0.57TODO | 0.67TODO | 0.60TODO | 0.53TODO | 0.67TODO | 0.63TODO | 0.47TODO | 0.70TODO | 0.50TODO | 0.57TODO |
| **deepseek-r1** | actions | 0.80 | 0.53 | 0.57 | 0.67 | 0.60 | 0.53 | 0.67 | 0.63 | 0.47 | 0.70 | 0.50 | 0.57 |
Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
......
This diff is collapsed.
Model,Given,Explicit,Implicit
deepseek-r1,0.8333333333333334,0.5666666666666667,0.6
deepseek-r1:7b,1.0,1.0,1.0
gpt-4.5-preview-2025-02-27,1.0,1.0,1.0
llama3,1.0,0.9,0.16666666666666666
llama3.3:latest,1.0,1.0,1.0
mistral-small,1.0,1.0,0.8666666666666667
mixtral:8x7b,1.0,1.0,0.5
import os
import asyncio
from typing import Dict, Literal
import json
import random
import re
import logging
import requests
from pydantic import BaseModel
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_core import CancellationToken
from autogen_ext.models.openai import OpenAIChatCompletionClient
import json
import random
from torchgen.dest.ufunc import eligible_for_binary_scalar_specialization
from belief import Belief
from sympy.physics.units import action
logger = logging.getLogger(__name__)
# Load API key from environment variable
# Load API keys from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PAGODA_API_KEY = os.getenv("PAGODA_API_KEY")
if not OPENAI_API_KEY:
raise ValueError("Missing OPENAI_API_KEY. Set it as an environment variable.")
if not PAGODA_API_KEY:
raise ValueError("Missing PAGODA_API_KEY. Set it as an environment variable.")
# Define the expected response format as a Pydantic model
class AgentResponse(BaseModel):
......@@ -42,7 +44,15 @@ class Ring:
self.max_retries = max_retries # Maximum retry attempts in case of hallucinations
is_openai_model = model.startswith("gpt")
base_url = "https://api.openai.com/v1" if is_openai_model else "http://localhost:11434/v1"
is_pagoda_model = ":" in model
self.base_url = (
"https://api.openai.com/v1" if is_openai_model else
"https://ollama-ui.pagoda.liris.cnrs.fr/ollama/api/generate" if is_pagoda_model else
"http://localhost:11434/v1"
)
key = OPENAI_API_KEY if is_openai_model else PAGODA_API_KEY
model_info = {
"temperature": self.temperature,
......@@ -55,7 +65,7 @@ class Ring:
self.model_client = OpenAIChatCompletionClient(
model=self.model,
base_url=base_url,
base_url=self.base_url,
api_key=OPENAI_API_KEY,
model_info=model_info,
response_format=AgentResponse
......@@ -116,6 +126,10 @@ class Ring:
if self.debug:
print(instruction)
is_pagoda_model = ":" in self.model
if is_pagoda_model:
return await self.run_pagoda(instruction)
for attempt in range(self.max_retries):
agent = AssistantAgent(
name="Player",
......@@ -155,6 +169,9 @@ class Ring:
def apply_strategy(self) -> Dict[str, str]:
"""Applies a heuristic-based strategy instead of relying on the model if strategy is enabled."""
# Set default values to avoid unbound variable errors
action = "X" # Default action (can be changed based on conditions)
reasoning = "Default reasoning. No specific model-based rule applied."
if self.model == "gpt-4.5-preview-2025-02-27":
if self.strategy:
if self.player_id == 2:
......@@ -163,6 +180,34 @@ class Ring:
else:
action = self.X if self.version in ["a", "c", "d"] else self.Y
reasoning = f"Choosing {action} based on the given game structure and expected rational behavior from Player 2."
if self.model == "llama3.3:latest":
XknowingA, XknowingB, YknowingA, YknowingB = (
(15, 5, 0, 10) if self.version == "a" else
(8, 7, 7, 8) if self.version == "b" else
(6, 5, 0, 10) if self.version == "c" else
(15, 5, 0, 40)
)
if self.belief == Belief.IMPLICIT:
if self.player_id == 1:
action = self.X if random.random() < 0.5 else self.Y
reasoning = "Choosing randomly between X and Y since it's an implicit game."
elif self.player_id == 2:
action = self.A if random.random() < 0.5 else self.B
reasoning = "Choosing randomly between A and B since it's an implicit game."
elif self.belief == Belief.EXPLICIT:
if self.player_id == 1:
action = self.X if XknowingA > YknowingA else self.Y
reasoning = f"Choosing {action} since it has a higher payoff ({XknowingA} vs {YknowingA})."
elif self.player_id == 2:
action = self.A if XknowingA + YknowingB > XknowingB + YknowingA else self.B
reasoning = f"Choosing {action} since it has a higher total payoff ({XknowingA + YknowingB} vs {XknowingB + YknowingA})."
if self.belief == Belief.GIVEN:
if self.player_id == 1:
action = self.X
reasoning = "Choosing X since Player 2 must choose A if she is rational."
elif self.player_id == 2:
action = self.A
reasoning = "Choosing A since I am rational and it's the dominant strategy."
if self.model == "llama3":
if self.player_id == 1:
action = self.X if random.random() < 0.5 else self.Y
......@@ -170,14 +215,16 @@ class Ring:
elif self.player_id == 2:
action = self.B if random.random() < 0.5 else self.A
reasoning = "The reasoning behind this choice is..."
if self.model == "mistral-small":
if self.model == "mistral-small" or self.model == "mixtral:8x7b":
#Always choose 'A' or 'X' based on player_id
if self.player_id == 1:
action = "X"
action = self.X
reasoning = f"Player {self.player_id} always chooses X as per the predefined strategy."
elif self.player_id == 2:
action = "B"
action = self.A
reasoning = f"Player {self.player_id} always chooses B as per the predefined strategy."
if self.model == "deepseek-r1:7b" or self.model == "deepseek-r1":
raise ValueError("Invalid strategy for deepseek-r1.")
# Validate the rationality of the chosen action
rational = 1.0 if self.check_rationality(AgentResponse(action=action, reasoning=reasoning)) else 0.0
return {
......@@ -186,9 +233,100 @@ class Ring:
"reasoning": reasoning
}
async def run_pagoda(self, instruction) -> Dict:
url = self.base_url
headers = {"Authorization": f"Bearer {PAGODA_API_KEY}", "Content-Type": "application/json"}
payload = {
"model": self.model,
"temperature": self.temperature,
"prompt": instruction,
"stream": False
}
for attempt in range(self.max_retries):
try:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
response_data = response.json()
if self.debug:
print(f"Raw response (Attempt {attempt + 1}): {response_data}")
# Extract JSON response field
response_json = response_data.get('response', '')
parsed_response = self.extract_json_from_response(response_json)
if not parsed_response:
print(f"Failed to extract JSON from response (Attempt {attempt + 1}): {response_json}")
continue
# Validate extracted response
required_keys = {'action', 'reasoning'}
if not required_keys.issubset(parsed_response.keys()):
print(f"Missing required keys in response (Attempt {attempt + 1}): {parsed_response}")
continue
action, reasoning = (
parsed_response["action"],
parsed_response["reasoning"]
)
rational = 1.0 if self.check_rationality(AgentResponse(action=action, reasoning=reasoning)) else 0.0
return {
"action": action,
"rationality": rational,
"reasoning": reasoning
}
except requests.RequestException as e:
print(f"Request error (Attempt {attempt + 1}): {e}")
except json.JSONDecodeError as e:
print(f"JSON decoding error (Attempt {attempt + 1}): {e}")
except Exception as e:
print(f"Unexpected error (Attempt {attempt + 1}): {e}")
raise ValueError("Pagoda model failed to provide a valid response after multiple attempts.")
def extract_json_from_response(self, response_text: str) -> dict:
"""Extracts and parses JSON from a model response, handling escaping issues."""
try:
# Normalize escaped underscores
cleaned_text = response_text.strip().replace('\\_', '_')
# Direct JSON parsing if response is already valid JSON
if cleaned_text.startswith("{") and cleaned_text.endswith("}"):
return json.loads(cleaned_text)
# Try extracting JSON from Markdown-style code blocks
json_match = re.search(r"```json\s*([\s\S]*?)\s*```", cleaned_text)
if json_match:
json_str = json_match.group(1).strip()
else:
# Try extracting any JSON-like substring
json_match = re.search(r"\{[\s\S]*?\}", cleaned_text)
if json_match:
json_str = json_match.group(0).strip()
else:
logger.warning("No JSON found in response: %s", response_text)
return {}
# Parse the extracted JSON
parsed_json = json.loads(json_str)
# Validate expected keys
expected_keys = {"action", "reasoning"}
if not expected_keys.issubset(parsed_json.keys()):
logger.warning("Missing required keys in parsed JSON: %s", parsed_json)
return {}
return parsed_json
except json.JSONDecodeError as e:
logger.error("Failed to parse extracted JSON: %s | Error: %s", response_text, e)
return {}
# Run the async function and return the response
if __name__ == "__main__":
game_agent = Ring(1, Belief.IMPLICIT, swap = True, version="b", model="mistral-small", temperature=0.7, strategy = True)
game_agent = Ring(1, Belief.EXPLICIT, swap = False, version="d", model="llama3.3:latest", temperature=0.7, strategy = True)# "llama3.3:latest", "mixtral:8x7b", "deepseek-r1:7b"
response_json = asyncio.run(game_agent.run())
print(response_json)
\ No newline at end of file
......@@ -77,11 +77,11 @@ class RingExperiment:
# Running the experiment
if __name__ == "__main__":
models = ["llama3", "mistral-small", "deepseek-r1"] # gpt-4.5-preview-2025-02-27 can be added to the list
models = ["llama3.3:latest", "deepseek-r1:7b", "mixtral:8x7b"] # "gpt-4.5-preview-2025-02-27", "llama3", "mistral-small", "deepseek-r1"
temperature = 0.7
iterations = 30
player_id = 1
version = "d"
version = "a"
output_file = f"../../data/ring/ring.{player_id}.{version}.csv"
experiment = RingExperiment(models=models, player_id = player_id, version = version, temperature = temperature, iterations=iterations, output_file = output_file)
asyncio.run(experiment.run_experiment())
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment