Skip to content
Snippets Groups Projects
Commit c1f55303 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

Evaluate the assessment of beliefs with Pagoda

parent 78406e1b
No related branches found
No related tags found
No related merge requests found
...@@ -344,6 +344,8 @@ Beliefs — whether implicit, explicit, or ...@@ -344,6 +344,8 @@ Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents. allow for anticipating the actions of other agents.
### Refine beliefs
To assess the agents' ability to refine their beliefs in predicting their To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where: Rock-Paper-Scissors (RPS) game where:
...@@ -359,24 +361,37 @@ For our experiments, we consider three simple models for the opponent where: ...@@ -359,24 +361,37 @@ For our experiments, we consider three simple models for the opponent where:
We evaluate the models' ability to identify these behavioural patterns by We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round. calculating the average number of points earned per round.
Figures presents the average points earned per round and the Figures present the average points earned per round and the
95\% confidence interval for each LLM against the three opponent behaviour 95% confidence interval for each LLM against the three opponent behavior
models in the simplified version of the RPS game, whether the LLM generates a models in a simplified version of the Rock-Paper-Scissors (RPS) game,
strategy or one-shot actions. We observe that the performance of LLMs in action whether the LLM generates a strategy or one-shot actions.
generation, except for Mistral-Small when facing a constant strategy,
is barely better than a random strategy. The strategies generated by the Neither <tt>Llama3</tt> nor <tt>DeepSeek-R1</tt> were able to generate a valid strategy.
GPT-4.5 and Mistral-Small models predict the opponent's next <tt>DeepSeek-R1:7b</tt> was unable to generate either a valid strategy
move based on previous rounds by identifying the most frequently played move. or consistently valid actions.
While these strategies are effective against an opponent with a constant
behavior, they fail to predict the opponent's next move when the latter adopts a We observe that the performance of most LLMs in action generation—
more complex model. Neither Llama3 nor DeepSeek-R1 were able except for <tt>Llama3.3:latest</tt>, <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt>
to generate a valid strategy. when facing a constant strategy—is barely better than a <tt>random</tt> strategy.
The strategies generated by the <tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> models
attempt to predict the opponent’s next move based on previous rounds
by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant behavior,
they fail to predict the opponent's next move when the latter adopts a more complex model.
![Average Points Earned per Round By Strategies Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_strategies.svg)
![Average Points Earned per Round By Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_models.svg)
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant.svg) ![Average Points Earned per Round by Strategies Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_strategies.svg)
![Average Points Earned per Round by Actions Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_models.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop.svg) ![Average Points Earned per Round by Strategies Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_strategies.svg)
![Average Points Earned per Round by Actions Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_models.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop.svg) ### Assess Beiliefs
To assess the agents’ ability to factor the prediction of their opponent’s next To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative move into their decision-making, we analyse their performance of each generative
......
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
This diff is collapsed.
...@@ -2,24 +2,30 @@ import os ...@@ -2,24 +2,30 @@ import os
import asyncio import asyncio
import csv import csv
import random import random
import json
import re
import requests
from typing import Dict, Literal, List, Callable from typing import Dict, Literal, List, Callable
from pydantic import BaseModel, ValidationError from pydantic import BaseModel, ValidationError
from autogen_agentchat.agents import AssistantAgent from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import TextMessage from autogen_agentchat.messages import TextMessage
from autogen_core import CancellationToken from autogen_core import CancellationToken
from autogen_ext.models.openai import OpenAIChatCompletionClient from autogen_ext.models.openai import OpenAIChatCompletionClient
import json
# Load API key from environment variable
# Load API keys from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PAGODA_API_KEY = os.getenv("PAGODA_API_KEY")
if not OPENAI_API_KEY: if not OPENAI_API_KEY:
raise ValueError("Missing OPENAI_API_KEY. Set it as an environment variable.") raise ValueError("Missing OPENAI_API_KEY. Set it as an environment variable.")
if not PAGODA_API_KEY:
raise ValueError("Missing PAGODA_API_KEY. Set it as an environment variable.")
CSV_FILE_PATH = "../../data/rps/rps.csv" CSV_FILE_PATH = "../../data/guess/guess.csv"
# Define the expected response format as a Pydantic model # Define the expected response format as a Pydantic model
class AgentResponse(BaseModel): class AgentResponse(BaseModel):
prediction: Literal["Rock", "Paper", "Scissor"] prediction: Literal["Rock", "Paper", "Scissors"]
reasoning: str reasoning: str
# Define Guess simulation class # Define Guess simulation class
...@@ -34,26 +40,32 @@ class Guess: ...@@ -34,26 +40,32 @@ class Guess:
self.opponent_strategy_fn = opponent_strategy_fn self.opponent_strategy_fn = opponent_strategy_fn
self.strategy = strategy # Determines whether to use a model or a rule-based method self.strategy = strategy # Determines whether to use a model or a rule-based method
if not strategy: # Use model-based prediction is_openai_model = model.startswith("gpt")
is_openai_model = model.startswith("gpt") is_pagoda_model = ":" in model
base_url = "https://api.openai.com/v1" if is_openai_model else "http://localhost:11434/v1"
model_info = { self.base_url = (
"temperature": self.temperature, "https://api.openai.com/v1" if is_openai_model else
"function_calling": True, "https://ollama-ui.pagoda.liris.cnrs.fr/ollama/api/generate" if is_pagoda_model else
"parallel_tool_calls": True, "http://localhost:11434/v1"
"family": "unknown", )
"json_output": True,
"vision": False model_info = {
} "temperature": self.temperature,
self.model_client = OpenAIChatCompletionClient( "function_calling": True,
model=self.model, "parallel_tool_calls": True,
base_url=base_url, "family": "unknown",
api_key=OPENAI_API_KEY, "json_output": True,
model_info=model_info, "vision": False
response_format=AgentResponse }
)
else: self.model_client = OpenAIChatCompletionClient(
self.model_client = None # No model needed for rule-based strategy model=self.model,
base_url=self.base_url,
api_key=OPENAI_API_KEY,
model_info=model_info,
response_format=AgentResponse
)
async def play_round(self, round_id: int) -> Dict: async def play_round(self, round_id: int) -> Dict:
"""Plays a single round of Guess The Next Move.""" """Plays a single round of Guess The Next Move."""
...@@ -95,9 +107,15 @@ class Guess: ...@@ -95,9 +107,15 @@ class Guess:
### **Your Task:** ### **Your Task:**
Based on the game history, predict the opponent's next move. Based on the game history, predict the opponent's next move.
Return your response in JSON format with two keys: Return your response in JSON format with two keys:
- `"prediction"`: Your predicted move (`"Rock"`, `"Paper"`, or `"Scissor"`). - `"prediction"`: Your predicted move (`"Rock"`, `"Paper"`, or `"Scissors"`).
- `"reasoning"`: A brief explanation of how you made your prediction. - `"reasoning"`: A brief explanation of how you made your prediction.
""" """
is_pagoda_model = ":" in self.model
if is_pagoda_model:
return await self.run_pagoda(instruction)
for attempt in range(1, self.max_retries + 1): for attempt in range(1, self.max_retries + 1):
agent = AssistantAgent( agent = AssistantAgent(
name="Player", name="Player",
...@@ -113,47 +131,124 @@ class Guess: ...@@ -113,47 +131,124 @@ class Guess:
agent_response = AgentResponse.model_validate_json(response_data) agent_response = AgentResponse.model_validate_json(response_data)
move, reasoning = agent_response.prediction, agent_response.reasoning move, reasoning = agent_response.prediction, agent_response.reasoning
if move in ["Rock", "Paper", "Scissor"]: if move in ["Rock", "Paper", "Scissors"]:
return move, reasoning return move, reasoning
except (ValidationError, json.JSONDecodeError) as e: except (ValidationError, json.JSONDecodeError) as e:
print(f"Error parsing response (Attempt {attempt}): {e}") print(f"Error parsing response (Attempt {attempt}): {e}")
raise ValueError("Model failed to provide a valid response after multiple attempts.") raise ValueError("Model failed to provide a valid response after multiple attempts.")
# Inside the Guess class
async def run_pagoda(self, instruction: str):
headers = {
"Authorization": f"Bearer {os.getenv('PAGODA_API_KEY')}",
"Content-Type": "application/json"
}
payload = {
"model": self.model,
"temperature": self.temperature,
"prompt": instruction,
"stream": False
}
for attempt in range(self.max_retries):
try:
response = requests.post(self.base_url, headers=headers, json=payload)
response.raise_for_status()
response_data = response.json()
raw_response = response_data.get("response", "")
parsed_json = self.extract_json_from_response(raw_response)
if not parsed_json:
print(f"Failed to parse JSON (Attempt {attempt + 1}): {raw_response}")
continue
agent_response = AgentResponse(**parsed_json)
if agent_response.prediction in ["Rock", "Paper", "Scissors"]:
return agent_response.prediction, agent_response.reasoning
except Exception as e:
print(f"Error in run_pagoda (Attempt {attempt + 1}): {e}")
raise ValueError("run_pagoda failed to get a valid response.")
def extract_json_from_response(self, text: str) -> dict:
"""Extract JSON object from raw model output."""
try:
json_str = re.search(r"\{.*\}", text, re.DOTALL)
if json_str:
return json.loads(json_str.group())
except Exception as e:
print(f"Error extracting JSON: {e}")
return {}
def apply_strategy(self): def apply_strategy(self):
"""Predicts the next move using a heuristic.""" """Predicts the next move using a heuristic."""
if self.model == "gpt-4.5-preview-2025-02-27": if self.model == "gpt-4.5-preview-2025-02-27":
if not self.history: if not self.history:
return random.choice(["Rock", "Paper", "Scissor"]), "No history available. Choosing randomly." return random.choice(["Rock", "Paper", "Scissors"]), "No history available. Choosing randomly."
# Count occurrences of each move move_counts = {"Rock": 0, "Paper": 0, "Scissors": 0}
move_counts = {"Rock": 0, "Paper": 0, "Scissor": 0}
for round_data in self.history: for round_data in self.history:
move_counts[round_data["Opponent Move"]] += 1 move_counts[round_data["Opponent Move"]] += 1
# Find the most common move
most_common_move = max(move_counts, key=move_counts.get) most_common_move = max(move_counts, key=move_counts.get)
predicted_move = most_common_move
reasoning = f"Based on history, the opponent most frequently played {most_common_move}." reasoning = f"Based on history, the opponent most frequently played {most_common_move}."
return predicted_move, reasoning return most_common_move, reasoning
if self.model == "llama3": elif self.model == "mistral-small":
return ["None", "error"]
if self.model == "mistral-small":
if not self.history: if not self.history:
# If there is no history, we can't make an educated guess. return "Scissors", "No game history available."
return ["Scissor", "No game history available."]
opponent_moves = [move['Opponent Move'] for move in self.history] opponent_moves = [move['Opponent Move'] for move in self.history]
move_count = { move_count = {
'Rock': opponent_moves.count('Rock'), 'Rock': opponent_moves.count('Rock'),
'Paper': opponent_moves.count('Paper'), 'Paper': opponent_moves.count('Paper'),
'Scissors': opponent_moves.count('Scissor') 'Scissors': opponent_moves.count('Scissors')
} }
# Determine the most frequent move
max_move = max(move_count, key=move_count.get) max_move = max(move_count, key=move_count.get)
if move_count[max_move] > 0: reasoning = f"Predicted {max_move} because it has been played {move_count[max_move]} times."
reasoning = f"Predicted {max_move} because it has been played {move_count[max_move]} times."
else:
reasoning = "Unable to determine a pattern; defaulting to Scissors."
return max_move, reasoning return max_move, reasoning
if self.model == "deepseek-r1": elif self.model in ["llama3", "deepseek-r1"]:
return ["None", "error"] return "Rock", f"Fallback strategy used for model: {self.model}."
elif self.model == ("llama3.3:latest"):
if not self.history:
# First round, make an arbitrary choice
return "Rock", "First round guess."
rock_count = sum(1 for r in self.history if r['Opponent Move'] == 'Rock')
paper_count = sum(1 for r in self.history if r['Opponent Move'] == 'Paper')
scissors_count = sum(1 for r in self.history if r['Opponent Move'] == 'Scissors')
# Predict the next move based on the most common opponent move
max_count = max(rock_count, paper_count, scissors_count)
if max_count == rock_count:
strategy_move = "Paper" # Paper beats Rock
elif max_count == paper_count:
strategy_move = "Scissors" # Scissors beats Paper
else:
strategy_move = "Rock" # Rock beats Scissors
return strategy_move, f"Strategy chose {strategy_move} based on opponent's move history."
elif self.model == "mixtral:8x7b":
recent_moves = self.history
if len(self.history) >= 3:
recent_moves = self.history[-3:]
return ["Rock", "Paper", "Scissors"][self.history.index(recent_moves[-1]) % 3][-1], "Recent move"
else:
# Otherwise, use a simple strategy based on the last move
opponent_last_move = recent_moves[-1] if recent_moves else None
if not opponent_last_move:
return "Rock", "Recent move"
else:
# Winning combinations
if opponent_last_move == "Scissors":
return "Paper", "Recent move"
elif opponent_last_move == "Paper":
return "Rock", "Recent move"
elif opponent_last_move == "Rock":
return "Scissors", "Recent move"
else:
return "Rock", "Recent move"
elif self.model == "deepseek-r1:7b":
moves = ["Rock", "Paper", "Scissors"]
return moves[len(self.history) % 3], "making decisions in a cyclic manner"
else:
return "Scissors", f"Unknown model '{self.model}'. Defaulting to Scissors."
@staticmethod @staticmethod
def determine_accuracy(player_move: str, opponent_move: str) -> int: def determine_accuracy(player_move: str, opponent_move: str) -> int:
...@@ -176,14 +271,16 @@ class Guess: ...@@ -176,14 +271,16 @@ class Guess:
summary += f"\nCurrent Score - You: {self.player_score_game}\n" summary += f"\nCurrent Score - You: {self.player_score_game}\n"
return summary return summary
def simple_opponent_strategy(history):
"""A simple opponent strategy that cycles through Rock, Paper, Scissor.""" def simple_opponent_strategy(history):
moves = ["Rock", "Paper", "Scissor"] """A simple opponent strategy that cycles through Rock, Paper, Scissors."""
return moves[len(history) % 3] moves = ["Rock", "Paper", "Scissors"]
return moves[len(history) % 3]
async def main(): async def main():
# Play with strategy-based approach # Play with strategy-based approach
game = Guess(model="mistral-small", temperature=0.7, game_id=1, opponent_strategy_fn=lambda history: "Rock", strategy=True) game = Guess(model="deepseek-r1:7b", temperature=0.7, game_id=1, opponent_strategy_fn=lambda history: "Rock", strategy=True)# "llama3.3:latest", "mixtral:8x7b", "deepseek-r1:7b"
num_rounds = 10 num_rounds = 10
for round_id in range(1, num_rounds + 1): for round_id in range(1, num_rounds + 1):
result = await game.play_round(round_id) result = await game.play_round(round_id)
......
import pandas as pd import pandas as pd
import numpy as np
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np
# Path to the CSV file # Path to the CSV file
CSV_FILE_PATH = "../../data/guess/guess.csv" CSV_FILE_PATH = "../../data/guess/guess.csv"
...@@ -15,59 +15,81 @@ df["outcomeRound"] = df["outcomeRound"].astype(float) ...@@ -15,59 +15,81 @@ df["outcomeRound"] = df["outcomeRound"].astype(float)
# List of opponent strategies to consider # List of opponent strategies to consider
opponent_strategies = ["R-P", "P-S", "S-R"] opponent_strategies = ["R-P", "P-S", "S-R"]
# **Fix Warning**: Ensure we work with a full copy # Filter and copy the relevant subset
df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy() df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy()
# Custom color palette for models # Color palette
color_palette = { color_palette = {
'gpt-4.5-preview-2025-02-27': '#7abaff', # BlueEscape 'gpt-4.5-preview-2025-02-27': '#7abaff',
'gpt-4.5-preview-2025-02-27 strategy': '#000037', # BlueHorizon 'gpt-4.5-preview-2025-02-27 strategy': '#7abaff',
'llama3': '#32a68c', # vertAvenir 'llama3': '#32a68c',
'mistral-small': '#ff6941', # orangeChaleureux 'llama3 strategy': '#32a68c',
'mistral-small strategy': '#ffd24b', # yellow determined 'llama3.3:latest': '#4b9f7d',
'deepseek-r1': '#5862ed' # indigoInclusif 'llama3.3:latest strategy': '#4b9f7d',
'mistral-small': '#ff6941',
'mistral-small strategy': '#ff6941',
'mixtral:8x7b': '#f1a61a',
'mixtral:8x7b strategy': '#f1a61a',
'deepseek-r1': '#5862ed',
'deepseek-r1 strategy': '#5862ed',
'deepseek-r1:7b': '#9a7bff',
'deepseek-r1:7b strategy': '#9a7bff',
'random': '#000000',
} }
# Group by model and round number, compute mean and standard deviation # Aggregate data
summary = df_filtered.groupby(["model", "idRound"]).agg( agg_data = df_filtered.groupby(["model", "idRound"]).agg(
mean_outcome=("outcomeRound", "mean"), mean_outcome=("outcomeRound", "mean"),
std_outcome=("outcomeRound", "std"), sem_outcome=("outcomeRound", lambda x: np.std(x, ddof=1) / np.sqrt(len(x)))
count=("outcomeRound", "count")
).reset_index() ).reset_index()
# Compute standard error (SEM) agg_data["ci95"] = 1.96 * agg_data["sem_outcome"]
summary["sem"] = summary["std_outcome"] / np.sqrt(summary["count"])
# Compute 95% confidence intervals ### --- First Figure: Models (no 'strategy' in name) ---
summary["ci_upper"] = summary["mean_outcome"] + (1.96 * summary["sem"])
summary["ci_lower"] = summary["mean_outcome"] - (1.96 * summary["sem"])
# Set the figure size
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
model_only = agg_data[~agg_data["model"].str.contains("strategy")]
# Loop through each model and plot its performance with confidence interval for model in model_only["model"].unique():
for model in summary["model"].unique(): df_model = model_only[model_only["model"] == model]
df_model = summary[summary["model"] == model] color = color_palette.get(model, '#63656a')
# Plot mean outcome plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
plt.plot(df_model["idRound"], df_model["mean_outcome"],
label=model,
color = color_palette.get(model, '#63656a')) # Default to light gray if model not in palette
# Plot confidence interval as a shaded region
plt.fill_between(df_model["idRound"], plt.fill_between(df_model["idRound"],
df_model["ci_lower"], df_model["ci_upper"], df_model["mean_outcome"] - df_model["ci95"],
color=color_palette.get(model, '#333333'), df_model["mean_outcome"] + df_model["ci95"],
alpha=0.2) # Transparency for better visibility color=color, alpha=0.2)
# Add legends and labels
plt.xlim(1, 10) plt.xlim(1, 10)
plt.xlabel("Round Number") plt.xlabel("Round Number")
plt.ylabel("Average Points Earned") plt.ylabel("Average Points Earned")
plt.title("Average Points Earned per Round Against 2-Loop Behaviour (95% CI)") plt.title("Model Performance Against Constant Strategies")
plt.legend() plt.legend()
plt.grid(True) plt.grid(True)
plt.ylim(0, 1) # Points are between 0 and 2 plt.ylim(0, 1)
plt.savefig('../../figures/guess/guess_2loop_models.svg', format='svg')
### --- Second Figure: Strategies (models with 'strategy' in name) ---
plt.figure(figsize=(10, 6))
strategy_only = agg_data[agg_data["model"].str.contains("strategy")]
for model in strategy_only["model"].unique():
df_model = strategy_only[strategy_only["model"] == model]
color = color_palette.get(model, '#63656a')
# Save the figure as an SVG file plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
plt.savefig('../../figures/guess/guess_2loop.svg', format='svg') plt.fill_between(df_model["idRound"],
df_model["mean_outcome"] - df_model["ci95"],
df_model["mean_outcome"] + df_model["ci95"],
color=color, alpha=0.2)
plt.xlim(1, 10)
plt.xlabel("Round Number")
plt.ylabel("Average Points Earned")
plt.title("Model Strategies vs Constant Behaviour")
plt.legend()
plt.grid(True)
plt.ylim(0, 1)
plt.savefig('../../figures/guess/guess_2loop_strategies.svg', format='svg')
...@@ -15,59 +15,98 @@ df["outcomeRound"] = df["outcomeRound"].astype(float) ...@@ -15,59 +15,98 @@ df["outcomeRound"] = df["outcomeRound"].astype(float)
# List of opponent strategies to consider # List of opponent strategies to consider
opponent_strategies = ["R-P-S"] opponent_strategies = ["R-P-S"]
# **Fix Warning**: Ensure we work with a full copy import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Path to the CSV file
CSV_FILE_PATH = "../../data/guess/guess.csv"
# Load the data
df = pd.read_csv(CSV_FILE_PATH)
# Convert necessary columns to appropriate types
df["idRound"] = df["idRound"].astype(int)
df["outcomeRound"] = df["outcomeRound"].astype(float)
# List of opponent strategies to consider
opponent_strategies = ["R-P-S"]
# Filter and copy the relevant subset
df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy() df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy()
# Custom color palette for models # Color palette
color_palette = { color_palette = {
'gpt-4.5-preview-2025-02-27': '#7abaff', # BlueEscape 'gpt-4.5-preview-2025-02-27': '#7abaff',
'gpt-4.5-preview-2025-02-27 strategy': '#000037', # BlueHorizon 'gpt-4.5-preview-2025-02-27 strategy': '#7abaff',
'llama3': '#32a68c', # vertAvenir 'llama3': '#32a68c',
'mistral-small': '#ff6941', # orangeChaleureux 'llama3 strategy': '#32a68c',
'mistral-small strategy': '#ffd24b', # yellow determined 'llama3.3:latest': '#4b9f7d',
'deepseek-r1': '#5862ed' # indigoInclusif 'llama3.3:latest strategy': '#4b9f7d',
'mistral-small': '#ff6941',
'mistral-small strategy': '#ff6941',
'mixtral:8x7b': '#f1a61a',
'mixtral:8x7b strategy': '#f1a61a',
'deepseek-r1': '#5862ed',
'deepseek-r1 strategy': '#5862ed',
'deepseek-r1:7b': '#9a7bff',
'deepseek-r1:7b strategy': '#9a7bff',
'random': '#000000',
} }
# Group by model and round number, compute mean and standard deviation # Aggregate data
summary = df_filtered.groupby(["model", "idRound"]).agg( agg_data = df_filtered.groupby(["model", "idRound"]).agg(
mean_outcome=("outcomeRound", "mean"), mean_outcome=("outcomeRound", "mean"),
std_outcome=("outcomeRound", "std"), sem_outcome=("outcomeRound", lambda x: np.std(x, ddof=1) / np.sqrt(len(x)))
count=("outcomeRound", "count")
).reset_index() ).reset_index()
# Compute standard error (SEM) agg_data["ci95"] = 1.96 * agg_data["sem_outcome"]
summary["sem"] = summary["std_outcome"] / np.sqrt(summary["count"])
# Compute 95% confidence intervals ### --- First Figure: Models (no 'strategy' in name) ---
summary["ci_upper"] = summary["mean_outcome"] + (1.96 * summary["sem"])
summary["ci_lower"] = summary["mean_outcome"] - (1.96 * summary["sem"])
# Set the figure size
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
model_only = agg_data[~agg_data["model"].str.contains("strategy")]
# Loop through each model and plot its performance with confidence interval for model in model_only["model"].unique():
for model in summary["model"].unique(): df_model = model_only[model_only["model"] == model]
df_model = summary[summary["model"] == model] color = color_palette.get(model, '#63656a')
# Plot mean outcome
plt.plot(df_model["idRound"], df_model["mean_outcome"],
label=model,
color = color_palette.get(model, '#63656a')) # Default to light gray if model not in palette
# Plot confidence interval as a shaded region plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
plt.fill_between(df_model["idRound"], plt.fill_between(df_model["idRound"],
df_model["ci_lower"], df_model["ci_upper"], df_model["mean_outcome"] - df_model["ci95"],
color=color_palette.get(model, '#333333'), df_model["mean_outcome"] + df_model["ci95"],
alpha=0.2) # Transparency for better visibility color=color, alpha=0.2)
# Add legends and labels plt.xlim(1, 10)
plt.xlabel("Round Number") plt.xlabel("Round Number")
plt.ylabel("Average Points Earned") plt.ylabel("Average Points Earned")
plt.title("Average Points Earned per Round Against 3-Loop Behaviour (95% CI)") plt.title("Model Performance Against Constant Strategies")
plt.legend() plt.legend()
plt.grid(True) plt.grid(True)
plt.xlim(1, 10) plt.ylim(0, 1)
plt.ylim(0, 1) # Points are between 0 and 2 plt.savefig('../../figures/guess/guess_3loop_models.svg', format='svg')
### --- Second Figure: Strategies (models with 'strategy' in name) ---
plt.figure(figsize=(10, 6))
strategy_only = agg_data[agg_data["model"].str.contains("strategy")]
# Save the figure as an SVG file for model in strategy_only["model"].unique():
plt.savefig('../../figures/guess/guess_3loop.svg', format='svg') df_model = strategy_only[strategy_only["model"] == model]
color = color_palette.get(model, '#63656a')
plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
plt.fill_between(df_model["idRound"],
df_model["mean_outcome"] - df_model["ci95"],
df_model["mean_outcome"] + df_model["ci95"],
color=color, alpha=0.2)
plt.xlim(1, 10)
plt.xlabel("Round Number")
plt.ylabel("Average Points Earned")
plt.title("Model Strategies vs Constant Behaviour")
plt.legend()
plt.grid(True)
plt.ylim(0, 1)
plt.savefig('../../figures/guess/guess_3loop_strategies.svg', format='svg')
...@@ -15,53 +15,81 @@ df["outcomeRound"] = df["outcomeRound"].astype(float) ...@@ -15,53 +15,81 @@ df["outcomeRound"] = df["outcomeRound"].astype(float)
# List of opponent strategies to consider # List of opponent strategies to consider
opponent_strategies = ["always_rock", "always_paper", "always_scissor"] opponent_strategies = ["always_rock", "always_paper", "always_scissor"]
# **Fix Warning**: Ensure we work with a full copy # Filter and copy the relevant subset
df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy() df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy()
# Custom color palette for models # Color palette
color_palette = { color_palette = {
'gpt-4.5-preview-2025-02-27': '#7abaff', # BlueEscape 'gpt-4.5-preview-2025-02-27': '#7abaff',
'gpt-4.5-preview-2025-02-27 strategy': '#000037', # BlueHorizon 'gpt-4.5-preview-2025-02-27 strategy': '#7abaff',
'llama3': '#32a68c', # vertAvenir 'llama3': '#32a68c',
'mistral-small': '#ff6941', # orangeChaleureux 'llama3 strategy': '#32a68c',
'mistral-small strategy': '#ffd24b', # yellow determined 'llama3.3:latest': '#4b9f7d',
'deepseek-r1': '#5862ed' # indigoInclusif 'llama3.3:latest strategy': '#4b9f7d',
'mistral-small': '#ff6941',
'mistral-small strategy': '#ff6941',
'mixtral:8x7b': '#f1a61a',
'mixtral:8x7b strategy': '#f1a61a',
'deepseek-r1': '#5862ed',
'deepseek-r1 strategy': '#5862ed',
'deepseek-r1:7b': '#9a7bff',
'deepseek-r1:7b strategy': '#9a7bff',
'random': '#000000',
} }
# Compute mean, standard error (SEM), and 95% confidence interval by model and round # Aggregate data
agg_data = df_filtered.groupby(["model", "idRound"]).agg( agg_data = df_filtered.groupby(["model", "idRound"]).agg(
mean_outcome=("outcomeRound", "mean"), mean_outcome=("outcomeRound", "mean"),
sem_outcome=("outcomeRound", lambda x: np.std(x, ddof=1) / np.sqrt(len(x))) # Standard error sem_outcome=("outcomeRound", lambda x: np.std(x, ddof=1) / np.sqrt(len(x)))
).reset_index() ).reset_index()
# Compute 95% Confidence Interval (CI) agg_data["ci95"] = 1.96 * agg_data["sem_outcome"]
agg_data["ci95"] = 1.96 * agg_data["sem_outcome"] # 95% confidence interval
### --- First Figure: Models (no 'strategy' in name) ---
# Set the figure size
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
model_only = agg_data[~agg_data["model"].str.contains("strategy")]
# Loop through each model and plot its aggregated performance across rounds for model in model_only["model"].unique():
for model in agg_data["model"].unique(): df_model = model_only[model_only["model"] == model]
df_model = agg_data[agg_data["model"] == model] color = color_palette.get(model, '#63656a')
color = color_palette.get(model, '#63656a') # Default to light gray if model not in palette
# Plot mean values
plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color) plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
# Add 95% confidence interval (shaded region)
plt.fill_between(df_model["idRound"], plt.fill_between(df_model["idRound"],
df_model["mean_outcome"] - df_model["ci95"], # Lower bound (95% CI) df_model["mean_outcome"] - df_model["ci95"],
df_model["mean_outcome"] + df_model["ci95"], # Upper bound (95% CI) df_model["mean_outcome"] + df_model["ci95"],
color=color, alpha=0.2) # Transparency for shading color=color, alpha=0.2)
# Add legends and labels
plt.xlim(1, 10) plt.xlim(1, 10)
plt.xlabel("Round Number") plt.xlabel("Round Number")
plt.ylabel("Average Points Earned") plt.ylabel("Average Points Earned")
plt.title("Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)") plt.title("Model Performance Against Constant Strategies")
plt.legend() plt.legend()
plt.grid(True) plt.grid(True)
plt.ylim(0, 1) # Points are between 0 and 1 plt.ylim(0, 1)
plt.savefig('../../figures/guess/guess_constant_models.svg', format='svg')
### --- Second Figure: Strategies (models with 'strategy' in name) ---
# Save the figure as an SVG file plt.figure(figsize=(10, 6))
plt.savefig('../../figures/guess/guess_constant.svg', format='svg') strategy_only = agg_data[agg_data["model"].str.contains("strategy")]
for model in strategy_only["model"].unique():
df_model = strategy_only[strategy_only["model"] == model]
color = color_palette.get(model, '#63656a')
plt.plot(df_model["idRound"], df_model["mean_outcome"], label=model, color=color)
plt.fill_between(df_model["idRound"],
df_model["mean_outcome"] - df_model["ci95"],
df_model["mean_outcome"] + df_model["ci95"],
color=color, alpha=0.2)
plt.xlim(1, 10)
plt.xlabel("Round Number")
plt.ylabel("Average Points Earned")
plt.title("Model Strategies vs Constant Behaviour")
plt.legend()
plt.grid(True)
plt.ylim(0, 1)
plt.savefig('../../figures/guess/guess_constant_strategies.svg', format='svg')
\ No newline at end of file
...@@ -8,11 +8,11 @@ CSV_FILE_PATH = "../../data/guess/guess.csv" ...@@ -8,11 +8,11 @@ CSV_FILE_PATH = "../../data/guess/guess.csv"
# Define RPS Constant Experiment class # Define RPS Constant Experiment class
class GuessExperiment: class GuessExperiment:
def __init__(self): def __init__(self):
self.models = ["mistral-small"] # You can also add "llama3" "deepseek-r1" "gpt-4.5-preview-2025-02-27", self.models = ["llama3", "deepseek-r1",] # You can also add "llama3" "deepseek-r1" "gpt-4.5-preview-2025-02-27", "mistral-small", "llama3.3:latest", "mixtral:8x7b", "deepseek-r1:7b"
self.opponent_strategies = { self.opponent_strategies = {
"always_rock": lambda history: "Rock", "always_rock": lambda history: "Rock",
"always_paper": lambda history: "Paper", "always_paper": lambda history: "Paper",
"always_scissor": lambda history: "Scissor", "always_scissors": lambda history: "Scissos",
"R-P": self.loop_R_P, "R-P": self.loop_R_P,
"P-S": self.loop_P_S, "P-S": self.loop_P_S,
"S-R": self.loop_S_R, "S-R": self.loop_S_R,
...@@ -20,11 +20,10 @@ class GuessExperiment: ...@@ -20,11 +20,10 @@ class GuessExperiment:
} }
self.temperature = 0.7 self.temperature = 0.7
self.rounds = 10 self.rounds = 10
self.num_games_per_config = 10#30 self.num_games_per_config = 10# 30
self.strategy = False self.strategy = True
self.initialize_csv() self.initialize_csv()
@staticmethod @staticmethod
def loop_R_P(history): def loop_R_P(history):
"""Alternates between Rock and Paper (R-P)""" """Alternates between Rock and Paper (R-P)"""
...@@ -40,20 +39,20 @@ class GuessExperiment: ...@@ -40,20 +39,20 @@ class GuessExperiment:
if len(history) % 2 == 0: if len(history) % 2 == 0:
return "Paper" return "Paper"
else: else:
return "Scissor" return "Scissors"
@staticmethod @staticmethod
def loop_S_R(history): def loop_S_R(history):
"""Alternates between Scissors and Rock (S-R)""" """Alternates between Scissors and Rock (S-R)"""
if len(history) % 2 == 0: if len(history) % 2 == 0:
return "Scissor" return "Scissors"
else: else:
return "Rock" return "Rock"
@staticmethod @staticmethod
def loop_R_P_S(history): def loop_R_P_S(history):
"""Alternates between Rock, Paper, and Scissors (R-P-S)""" """Alternates between Rock, Paper, and Scissors (R-P-S)"""
strategies = ["Rock", "Paper", "Scissor"] strategies = ["Rock", "Paper", "Scissors"]
return strategies[len(history) % 3] return strategies[len(history) % 3]
def initialize_csv(self): def initialize_csv(self):
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment