Skip to content
Snippets Groups Projects
Commit 98e2aa16 authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

The assessment of beliefs with Pagoda

parent 9d155feb
No related branches found
No related tags found
No related merge requests found
......@@ -368,19 +368,15 @@ whether the LLM generates a strategy or one-shot actions.
Neither <tt>Llama3</tt> nor <tt>DeepSeek-R1</tt> were able to generate a valid strategy.
<tt>DeepSeek-R1:7b</tt> was unable to generate either a valid strategy
or consistently valid actions.
or consistently valid actions. The strategies generated by the <tt>GPT-4.5</tt>
and <tt>Mistral-Small</tt> models attempt to predict the opponent’s next move based
on previous rounds by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant behavior,
they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation—
except for <tt>Llama3.3:latest</tt>, <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt>
when facing a constant strategy—is barely better than a <tt>random</tt> strategy.
The strategies generated by the <tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> models
attempt to predict the opponent’s next move based on previous rounds
by identifying the most frequently played move.
While these strategies are effective against an opponent with a constant behavior,
they fail to predict the opponent's next move when the latter adopts a more complex model.
![Average Points Earned per Round By Strategies Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_strategies.svg)
![Average Points Earned per Round By Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_models.svg)
......@@ -391,44 +387,48 @@ they fail to predict the opponent's next move when the latter adopts a more comp
![Average Points Earned per Round by Strategies Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_strategies.svg)
![Average Points Earned per Round by Actions Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_models.svg)
### Assess Beiliefs
### Assess Beliefs
To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.
Figures below illustrates the average points earned per round along with
Figures below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
whether the model generates a full strategy or one-shot actions. The results
show that LLMs’ performance in action generation against a constant strategy is
only marginally better than a random strategy. While Mistral-Small can
accurately predict its opponent’s move, it fails to integrate this belief into
its decision-making process.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
![Average Points Earned per Round Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_2loop.svg)
![Average Points Earned per Round Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/rps/rps_3loop.svg)
when the model generates one-shot actions.
Even if <tt>Mixtral:8x7b</tt>, and <tt>Mistral-Small</tt> accurately predict its
opponent’s move, they fails to integrate this belief into
its decision-making process. Only <tt>Llama3.3:latest</tt> is capable of inferring
the opponent’s behavior to choose the winning move.
In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.
![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)
## Synthesis
Our results show that Mistral-Small exhibits the highest level of economic rationality,
while Llama 3 shows moderate consistency, and DeepSeek-R1 remains highly inconsistent.
GPT-4.5, Llama3, and Mistral-Small generally respect preferences but encounter more
difficulties in generating one-shot actions than in producing strategies in the
form of algorithms. GPT-4.5 and Mistral-Small generally adopt
rational behaviours of both first and second order, whereas Llama3,
despite generating random strategies, adapts better when producing one-shot
actions. In contrast, DeepSeek-R1 fails to develop valid strategies and
performs poorly in generating actions that align with preferences or rationality
principles. More critically, all the LLMs we evaluated struggle both to
anticipate other agents’ actions or to integrate them effectively into their
decision-making process.
Our findings reveal notable differences in the cognitive capabilities of LLMs
across multiple dimensions of decision-making.
<tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making,
with <tt>Llama3</tt> showing moderate adherence and </tt>DeepSeek-R1</tt> displaying considerable inconsistency.
<tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences,
particularly when generating algorithmic strategies rather than isolated one-shot actions.
These models tend to struggle more with one-shot decision-making, where responses are less structured and
more prone to inconsistency. In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and
performs poorly in aligning actions with specified preferences.
<tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> consistently display rational behavior at both first- and second-order levels.
<tt>Llama3</tt>, although prone to random behavior when generating strategies, adapts more effectively in one-shot
decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly in both strategic and one-shot formats, rarely
exhibiting coherent rationality.
All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents
into their own decisions. Despite some being able to identify patterns,
most fail to translate these beliefs into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to
infer and act on opponents’ simple behaviour
## Authors
......
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
This diff is collapsed.
File deleted
This diff is collapsed.
This diff is collapsed.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Path to the CSV file
CSV_FILE_PATH = "../../data/rps/rps.csv"
# Load the data
df = pd.read_csv(CSV_FILE_PATH)
# Convert necessary columns to appropriate types
df["idRound"] = df["idRound"].astype(int)
df["outcomeRound"] = df["outcomeRound"].astype(float)
# List of opponent strategies to consider
opponent_strategies = ["R-P", "P-S", "S-R"]
# **Fix Warning**: Ensure we work with a full copy
df_filtered = df[df["opponentStrategy"].isin(opponent_strategies)].copy()
# Custom color palette for models
color_palette = {
'gpt-4.5-preview-2025-02-27': '#7abaff', # BlueEscape
'gpt-4.5-preview-2025-02-27 strategy': '#000037', # BlueHorizon
'llama3': '#32a68c', # vertAvenir
'mistral-small': '#ff6941', # orangeChaleureux
'mistral-small strategy': '#ffd24b', # yellow determined
'deepseek-r1': '#5862ed' # indigoInclusif
}
# Group by model and round number, compute mean and standard deviation
summary = df_filtered.groupby(["model", "idRound"]).agg(
mean_outcome=("outcomeRound", "mean"),
std_outcome=("outcomeRound", "std"),
count=("outcomeRound", "count")
).reset_index()
# Compute standard error (SEM)
summary["sem"] = summary["std_outcome"] / np.sqrt(summary["count"])
# Compute 95% confidence intervals
summary["ci_upper"] = summary["mean_outcome"] + (1.96 * summary["sem"])
summary["ci_lower"] = summary["mean_outcome"] - (1.96 * summary["sem"])
# Set the figure size
plt.figure(figsize=(10, 6))
# Loop through each model and plot its performance with confidence interval
for model in summary["model"].unique():
df_model = summary[summary["model"] == model]
# Plot mean outcome
plt.plot(df_model["idRound"], df_model["mean_outcome"],
label=model,
color = color_palette.get(model, '#63656a')) # Default to light gray if model not in palette
# Plot confidence interval as a shaded region
plt.fill_between(df_model["idRound"],
df_model["ci_lower"], df_model["ci_upper"],
color=color_palette.get(model, '#333333'),
alpha=0.2) # Transparency for better visibility
# Add legends and labels
plt.xlim(1, 10)
plt.xlabel("Round Number")
plt.ylabel("Average Points Earned")
plt.title("Average Points Earned per Round Against 2-Loop Behaviour (95% CI)")
plt.legend()
plt.grid(True)
plt.ylim(0, 2) # Points are between 0 and 2
# Save the figure as an SVG file
plt.savefig('../../figures/rps/rps_2loop.svg', format='svg')
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment