Skip to content
Snippets Groups Projects
Commit 398a2554 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

PyGAAMAS: Evaluation of the respnder in the ultimatum game

parent e5b0d13a
No related branches found
No related tags found
No related merge requests found
...@@ -190,31 +190,50 @@ endowment (e.g., a sum of money) between themselves and a second player, the res ...@@ -190,31 +190,50 @@ endowment (e.g., a sum of money) between themselves and a second player, the res
unlike in the dictator game, the responder plays an active role: they can either accept or reject unlike in the dictator game, the responder plays an active role: they can either accept or reject
the proposed allocation. If the offer is rejected, both players receive nothing. the proposed allocation. If the offer is rejected, both players receive nothing.
First, we evaluate the choices made by LLMs when playing the role of the proposer, interpreting these decisions as a Firstly, we evaluate the choices made by LLMs when playing the role of the proposer, interpreting these decisions as a
reflection of their implicit social norms or strategic preferences, especially when anticipating potential reflection of their implicit social norms or strategic preferences, especially when anticipating potential
rejection by the responder. rejection by the responder. Oosterbeek et al. find that on average the proposer offers 40% of the pie to the responder.
Oosterbeek, H., Sloof, R., & Van De Kuilen, G. (2004).
Here, we consider that the choice of an LLM as a proposer reflects its intrinsic *Cultural differences in ultimatum game experiments: Evidence from a meta-analysis*. Experimental Economics,
social preferences. Each LLM is asked to directly produce a one-shot action in the 7, 171–188. [https://doi.org/10.1023/B:EXEC.0000026978.14316.74](https://doi.org/10.1023/B:EXEC.0000026978.14316.74)
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the <tt>Python</tt> language.
The figure below presents a violin plot illustrating the share of the total amount (\$100)
The figure below presents a violin plot illustrating the share of the total amount (\$100) that the proposer that the proposer allocates to themselves for each model. The share selected by strategies
allocates to themselves for each model. The share selected by strategies generated by <tt>Llama3</tt>, generated by <tt>Llama3</tt>, <tt>Mistral-Small</tt>, and <tt>Qwen3</tt> aligns with the median
<tt>Mistral-Small</tt>, and <tt>Qwen3</tt> aligns with the median share chosen by actions generated by the models share chosen by actions generated by the models <tt>Mistral-Small</tt>, <tt>Mixtral:8x7B</tt>, and
<tt>Mistral-Small</tt>, <tt>Mixtral:8x7B</tt>, and <tt>DeepSeek-R1:7B</tt>, around \$50 — <tt>DeepSeek-R1:7B</tt>, around $50 — likely reflecting corpus-based biases, such as term frequency.
likely reflecting corpus-based biases, such as term frequency. The share selected by strategies generated by <tt>Llama3.3</tt> and <tt>DeepSeek-R1:7B</tt>
The share selected by strategies generated by <tt>Llama3.3</tt> and <tt>DeepSeek-R1:7B</tt> resembles resembles the median share in the actions generated by <tt>GPT-4.5</tt> and <tt>Llama3</tt>,
the median share in the actions generated by <tt>GPT-4.5</tt> and <tt>Llama3</tt>, around $60, around \$60, which is consistent with what human participants typically choose under similar conditions.
which is consistent with what human participants typically choose under similar conditions. While the shares selected by strategies from <tt>GPT-4.5</tt> and <tt>Mixtral:8x7B</tt> are respectively
While the shares selected by strategies from <tt>GPT-4.5</tt> and <tt>Mixtral:8x7B</tt> are respectively
overestimated and underestimated, the actions generated by <tt>DeepSeek-R1:7B</tt> and <tt>Qwen3</tt> overestimated and underestimated, the actions generated by <tt>DeepSeek-R1:7B</tt> and <tt>Qwen3</tt>
can be considered irrational. can be considered irrational.
![Violin Plot of My Share for Each Model](figures/ultimatum/proposer_violin.svg) ![Violin Plot of My Share for Each Model](figures/ultimatum/proposer_violin.svg)
Secondly, we analyze the behavior of LLMs when assuming the role of the responder,
focusing on whether their acceptance or rejection of offers reveals a human-like sensitivity to unfairness.
The meta-analysis by Oosterbeek et al. (2004) reports that human participants
reject 16% of offers, amounting to 40% of the total stake. This finding suggests that factors
beyond purely economic self-interest—such as fairness concerns or the desire to punish perceived
injustice—significantly influence decision-making.
The figure below presents a violin plot illustrating the acceptance rate of the responder for each
model when offered \$40 out of \$100. While the median acceptance rate of responses generated by
<tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Llama3.3</tt>, <tt>Mixtral:8x7B</tt>, <tt>Deepseek-R1:7B</tt>,
and <tt>Qwen3</tt> is 1.0, the median acceptance rate for <tt>Mistral-Small</tt> and <tt>Deepseek-R1</tt> is 0.0.
It is worth noting that these results are not necessarily compliant with the strategies generated by the models.
For instance, <tt>GPT-4.5</tt> accepts offers as low as 20%, interpreting them as minimally fair,
while <tt>Mistral-Small</tt> employs a tiered strategy that only consistently accepts offers of 50% or more,
and randomly accepts those between 25% and 49%. Models like <tt>Llama3</tt>, <tt>Deepseek-R1</tt>, and
<tt>Qwen3</tt> exhibit rigid fairness thresholds, rejecting any offer below 50%.
<tt>Llama3.3</tt> uses a slightly more permissive threshold of 30%, leading to greater acceptance
at lower offers. These results suggest that most LLMs do not capture the influence of perceived injustice
that shapes human decision-making in the ultimatum game.
![Violin Plot of Acceptance Rate for Each Model](figures/ultimatum/responder_violin.svg)
## Strategic Rationality ## Strategic Rationality
......
This diff is collapsed.
This diff is collapsed.
...@@ -37,7 +37,7 @@ class UltimatumExperiment: ...@@ -37,7 +37,7 @@ class UltimatumExperiment:
if __name__ == "__main__": if __name__ == "__main__":
models = ["qwen3"] models = ["qwen3"]
# # "gpt-4.5-preview-2025-02-27" "llama3", "mistral-small", "deepseek-r1", "qwen3", "mixtral:8x7b", "llama3.3:latest", "deepseek-r1:7b" # "gpt-4.5-preview-2025-02-27" "llama3", "mistral-small", "deepseek-r1", "qwen3", "mixtral:8x7b", "llama3.3:latest", "deepseek-r1:7b"
temperature = 0.7 temperature = 0.7
amount = 100 amount = 100
iterations = 30 iterations = 30
......
import os
import asyncio
import json
import requests
from typing import Dict, Literal
from pydantic import BaseModel
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage
from autogen_core import CancellationToken
from autogen_ext.models.openai import OpenAIChatCompletionClient
import re
# Load API keys from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
raise ValueError("Missing OPENAI_API_KEY. Set it as an environment variable.")
PAGODA_API_KEY = os.getenv("PAGODA_API_KEY")
if not PAGODA_API_KEY:
raise ValueError("Missing PAGODA_API_KEY. Set it as an environment variable.")
# Define the expected response format as a Pydantic model
class AgentResponse(BaseModel):
response: Literal["Accept", "Reject"]
reasoning: str
# The ultimatum game simulation class
class Responder:
def __init__(self, amount: int, offer: int, model: str, temperature: float, strategy=False, max_retries: int = 3):
self.debug = False
self.amount = amount
self.offer = offer
self.model = model
self.temperature = temperature
self.strategy = strategy
self.max_retries = max_retries
is_openai_model = model.startswith("gpt")
is_pagoda_model = ":" in model
base_url = (
"https://api.openai.com/v1" if is_openai_model else
"https://ollama-ui.pagoda.liris.cnrs.fr/ollama/api/generate" if is_pagoda_model else
"http://localhost:11434/v1"
)
self.base_url = base_url
if self.debug:
print(f"Using model: {model}")
print(f"Using base URL: {base_url}")
key = OPENAI_API_KEY if is_openai_model else PAGODA_API_KEY
model_info = {
"temperature": self.temperature,
"function_calling": True,
"parallel_tool_calls": False,
"family": "unknown",
"json_output": True,
"vision": False
}
self.model_client = OpenAIChatCompletionClient(
model=self.model,
base_url=base_url,
api_key=key,
model_info=model_info,
response_format=AgentResponse
)
async def run(self) -> Dict:
"""Runs the model if strategy is False, otherwise uses a classical method."""
if self.strategy:
return self.apply_strategy()
instruction = f"""
You participate in an economics experiment. In this experiment each of you will
be paired with a different person. You will not be told who these people are
either during or after the experiment, and they will not be told who you are
either during or after the experiment.
The experiment is conducted as follows: A sum of ${self.amount} has been provisionally
allocated to each pair and the person A can propose how much of this each person
is to receive. A proposal form will then be sent to the person B.
The person B will then be given a chance to accept or reject the proposal. If
the person B accepts the proposal, then the amount of money will be divided as
specified in the proposal. If the person B rejects the proposal, then both
people in the pair receive zero. A response form will be sent to the person A.
You are person B. Person A's offer is ${self.offer}.
Return your response in JSON format with two keys:
- `response`: Your response (`"Accept"` or `"Reject"`).
- `reasoning`: A brief explanation of how you made your prediction.
"""
if self.debug:
print(instruction)
is_pagoda_model = ":" in self.model
if is_pagoda_model:
return await self.run_pagoda(instruction)
for attempt in range(self.max_retries):
agent = AssistantAgent(
name="Proposer",
model_client=self.model_client,
system_message="You are a helpful assistant."
)
response = await agent.on_messages(
[TextMessage(content=instruction, source="user")],
cancellation_token=CancellationToken(),
)
try:
# Correct: get the content from the chat message
raw_text = response.chat_message.content
# Debug: show the raw content
print(f"Raw content (Attempt {attempt + 1}): {raw_text}")
# Try to load JSON directly
try:
response_json = json.loads(raw_text)
except json.JSONDecodeError:
# If it's wrapped in ```json ... ```, extract it
match = re.search(r'```json\s*(.*?)\s*```', raw_text, re.DOTALL)
if match:
response_json = json.loads(match.group(1))
else:
print(f"Could not parse JSON from response (Attempt {attempt + 1})")
continue
agent_response = AgentResponse(**response_json)
return agent_response
except Exception as e:
print(f"Error in OpenAI response handling (Attempt {attempt + 1}): {e}")
raise ValueError("Model failed to provide a valid response after multiple attempts.")
async def run_pagoda(self, instruction) -> Dict:
"""Runs the Pagoda model using a direct request."""
url = self.base_url
headers = {
"Authorization": f"Bearer {PAGODA_API_KEY}",
"Content-Type": "application/json"
}
payload = {
"model": self.model,
"temperature": self.temperature,
"prompt": instruction,
"stream": False,
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "AgentResponse",
"strict": True,
"schema": {
"title": "AgentResponse",
"type": "object",
"properties": {
"response": {
"title": "My Response",
"type": "string",
},
"reasoning": {
"title": "Reasoning",
"type": "string"
}
},
"required": ["response" "reasoning"],
"additionalProperties": False
}
}
}
}
for attempt in range(self.max_retries):
try:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
# Get the JSON response
response_data = response.json()
# Debug: print the raw response to check if fields are missing or named differently
if self.debug:
print(f"Raw response (Attempt {attempt+1}): {response_data}")
# The response field should be parsed correctly if it's already valid JSON
response_json = response_data.get('response', '')
# If the response is a string containing JSON, we need to extract and parse it
if isinstance(response_json, str):
# Try to parse the response as JSON
try:
response_dict = json.loads(response_json)
except json.JSONDecodeError:
# If the response is not valid JSON, apply regex to extract the JSON portion
match = re.search(r"```json(.*?)```", response_json, re.DOTALL)
if match:
response_dict = json.loads(match.group(1))
else:
print(f"Invalid response format detected (Attempt {attempt + 1}): {response_json}")
continue
elif isinstance(response_json, dict):
# If response_json is already a dictionary, just use it
response_dict = response_json
else:
print(f"Unexpected format in 'response' field (Attempt {attempt + 1}): {response_json}")
continue
# Validate the response structure
agent_response = AgentResponse(**response_dict)
return agent_response.model_dump()
except Exception as e:
print(f"Error in Pagoda request (Attempt {attempt + 1}): {e}")
raise ValueError("Pagoda model failed to provide a valid response after multiple attempts.")
def apply_strategy(self) -> Dict:
"""Generates a response based on predefined strategies."""
if self.model == "gpt-4.5-preview-2025-02-27":
# Threshold below which the offer is considered insulting or unfair
fairness_threshold = 0.2 # 20% of the total amount
offer_ratio = self.offer / self.amount
if offer_ratio >= fairness_threshold:
response = "Accept"
reasoning = (
f"The offer of ${self.offer} is {offer_ratio:.0%} of the total amount (${self.amount}), "
f"which is deemed fair or acceptable. Accepting ensures a positive payout."
)
else:
response = "Reject"
reasoning = (
f"The offer of ${self.offer} is only {offer_ratio:.0%} of the total amount (${self.amount}), "
f"which is considered unfair. Rejecting signals a preference for fairness over gain."
)
return {
"response": response,
"reasoning": reasoning
}
if self.model in ["mistral_small"]:
offer_percentage = self.offer / self.amount * 100
if offer_percentage >= 50:
# Accept the offer because it's fair or generous.
response = "Accept"
reasoning = f"The offer is {offer_percentage}% of the total amount, which is considered reasonable."
elif offer_percentage < 25:
# Reject the offer because it's too low.
response = "Reject"
reasoning = f"The offer is only {offer_percentage}% of the total amount, which is not acceptable."
else:
# Randomly decide for offers between 25% and 49%
if random.random() < 0.5: # You can adjust this probability as needed.
response = "Accept"
reasoning = f"The offer is {offer_percentage}% of the total amount, which could be acceptable."
else:
response = "Reject"
reasoning = f"The offer is {offer_percentage}% of the total amount, which might not be enough."
return {"response": response, "reasoning": reasoning}
if self.model in ["llama3", "deepseek-r1", "qwen3", "mixtral:8x7b"]:
if self.offer >= (self.amount / 2):
# Accept offer
return {"response": "Accept",
"reasoning": f"The offered amount is more than half of the total, so I accept."}
else:
# Reject offer
return {"response": "Reject",
"reasoning": f"The offered amount is less than half of the total, so I reject."}
if self.model in ["llama3.3:latest"]:
if self.offer / self.amount >= 0.3:
response = "Accept"
reasoning = f"The offer of ${self.offer} is greater than or equal to 30% of the total amount (${self.amount}), so I accept."
else:
response = "Reject"
reasoning = f"The offer of ${self.offer} is less than 30% of the total amount (${self.amount}), so I reject."
return {"response": response, "reasoning": reasoning}
return None
# Run the async function and return the response
if __name__ == "__main__":
agent = Responder(amount=100, offer=40, model="qwen3", temperature=0.7, strategy=False)
# "gpt-4.5-preview-2025-02-27" "llama3", "mistral-small", "deepseek-r1", "qwen3", "mixtral:8x7b", "llama3.3:latest", "deepseek-r1:7b"
response_json = asyncio.run(agent.run())
print(response_json)
\ No newline at end of file
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Definition of the color palette
color_palette = {
'random': '#333333', # Black
'gpt-4.5-preview-2025-02-27': '#7abaff', # BlueEscape
'llama3': '#32a68c', # GreenFuture
'llama3.3:latest': '#4b9f7d', # GreenLlama3.3
'mistral-small': '#ff6941', # WarmOrange
'mixtral:8x7b': '#f1a61a', # YellowMixtral
'deepseek-r1': '#5862ed', # InclusiveIndigo
'deepseek-r1:7b': '#9a7bff', # PurpleDeepseek-r1:7b
'qwen3': '#c02942'
}
# Load the data
data = pd.read_csv("../../data/ultimatum/responder.csv") # Replace with the correct path to your CSV file
# Specify the order of models for the x-axis
model_order = [
'gpt-4.5-preview-2025-02-27',
'llama3', 'llama3.3:latest', # Place llama3 and llama3.3:latest together
'mistral-small', 'mixtral:8x7b', # Bring mistral-small and mixtral:8x7b closer
'deepseek-r1', 'deepseek-r1:7b',
'qwen3'
]
# Create the violin plot
plt.figure(figsize=(12, 6))
sns.violinplot(
data=data,
x="model",
y="accept",
hue="model", # Use hue to manage the colors
palette=color_palette,
inner="quartile", # Displays quartiles inside the violin
density_norm="width", # Normalizes the width of the violins for comparison
order=model_order # Explicitly set the order of the models on the x-axis
)
# Ajouter les valeurs médianes comme annotations sur le graphique
for model in model_order:
mediane = data[data['model'] == model]['accept'].median()
plt.text(model_order.index(model), mediane, f'{mediane:.1f}',
horizontalalignment='center', verticalalignment='bottom')
# Set the y-axis limits between 0 and 100
plt.ylim(0.0, 1.0)
# Labels and title
plt.xlabel("Model")
plt.ylabel("Acceptance rate")
plt.title("Distribution of acceptance rate by model in the ultimatum game")
plt.legend("")
# Save and display the plot
plt.savefig("../../figures/ultimatum/responder_violin.svg", format="svg")
\ No newline at end of file
import asyncio
from responder import Responder
class UltimatumExperiment:
debug = True
def __init__(self, models: list[str], temperature: float, amount: int, offer: int, iterations: int, output_file: str):
self.models = models
self.temperature = temperature
self.amount = amount
self.offer = offer
self.iterations = iterations
self.output_file = output_file
with open(self.output_file, 'w', encoding='utf-8') as f:
f.write("iteration,model,temperature,amount,offer,accept,reasoning\n")
async def run_experiment(self):
for model in self.models:
if self.debug:
print(f"Running experiment for model: {model}")
for iteration in range(1, self.iterations + 1):
game_agent = Responder(amount=self.amount, offer=self.offer, model=model, temperature=self.temperature)
feedback = await game_agent.run()
if self.debug:
print(feedback)
# Utilisation de la notation de crochets pour accéder aux valeurs du dictionnaire
answer = feedback['response']
answerValue = 1.0 if answer == "Accept" else 0.0
reasoning = feedback['reasoning'].replace('"', '""')
with open(self.output_file, 'a', encoding='utf-8') as f:
f.write(f'{iteration},{model},{self.temperature},{self.amount},{offer},{answerValue},"{reasoning}"\n')
if __name__ == "__main__":
models = ["mixtral:8x7b", "llama3.3:latest", "deepseek-r1:7b"]
# "gpt-4.5-preview-2025-02-27" "llama3", "mistral-small", "deepseek-r1", "qwen3", "mixtral:8x7b", "llama3.3:latest", "deepseek-r1:7b"
temperature = 0.7
amount = 100
offer = 40
iterations = 30
output_file = '../../data/ultimatum/responder.csv'
experiment = UltimatumExperiment(models=models, temperature=temperature, amount=amount, offer=offer, iterations=iterations, output_file=output_file)
asyncio.run(experiment.run_experiment())
print(f"Experiment results saved to {output_file}")
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment