Skip to content
Snippets Groups Projects
Commit 9c17189d authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

PyGAAMAS: Add Anonymous repositery

parent baa10a40
No related branches found
No related tags found
No related merge requests found
......@@ -57,9 +57,8 @@ analyzing the frequency of the opponent’s past moves and selecting the less
common one. \texttt{Qwen3}, by contrast, relies on randomness, choosing moves
unpredictably while presuming the opponent will mirror its choice.
\texttt{LLama3} does not implement a functioning strategy. Overall, these
model-generated strategies are simplistic and heuristic-based, often lacking the
credibility and adaptability needed for effective play in adversarial settings
like MP.
model-generated strategies are simplistic and heuristic-driven, often lacking
the credibility and adaptability required to simulate human behavior.
Fig.~\ref{fig:mp_prediction_constant} (resp. Fig.~\ref{fig:mp_payoff_constant})
illustrates the average prediction accuracy (resp. the number of points earned)
......@@ -73,14 +72,14 @@ information into their action selection to choose the winning move.
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_constant.pdf}
\caption{Prediction accuracy per round against a constant opponent strategy.}
\caption{Prediction accuracy per round against a constant strategy.}
\label{fig:mp_prediction_constant}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_constant.pdf}
\caption{Average points per round against a constant opponent strategy.}
\caption{Average points per round against a constant strategy.}
\label{fig:mp_payoff_constant}
\end{figure}
......@@ -93,13 +92,13 @@ alternating strategy, is barely better than a random strategy.
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_altern.pdf}
\caption{Prediction accuracy per round against an alternating opponent strategy.}
\caption{Prediction accuracy per round against an alternating strategy.}
\label{fig:mp_prediction_altern}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_altern.pdf}
\caption{Average points per round against an alternating opponent strategy.}
\caption{Average points per round against an alternating strategy.}
\label{fig:mp_payoff_altern}
\end{figure}
\section{Conclusion}
\label{sec:conclusion}
In this paper, we evaluate the ability of generative agents to exhibit credible
behavior in social situations, adapt to their interlocutor, and coordinate with
them. \texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness,
but only \texttt{Mistral-Small} consistently exhibits sensitivity across
incentive environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display
rigid rationality, unconditional cooperation, or inconsistent and unstable
behavior. Unlike human, all models, regardless of size or architecture, struggle
to exploit perceived regularities in the interlocutor’s behavior. Although some
models are able to detect patterns, most fail to translate these beliefs into
their own decisions. When it comes to coordination, most generative agents
struggle to align their actions in games with multiple equilibria. This failure
stems from a limited ability to model the opponent’s behavior accurately and
incorporate theses beliefs in their practical reasoning. Although communication
is expected to improve coordination, it often introduces ambiguity instead:
models generate cooperative messages that are not followed by consistent
actions, leading to misaligned expectations and degraded coordination. Only
\texttt{Qwen3} shows reliable coordination behavior, swiftly incorporating
beliefs about the opponent’s strategy even without communication.
In this paper, we evaluated whether GAs can act in socially plausible ways,
align their strategies with others, and adapt dynamically to their environment.
\texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness, but only
\texttt{Mistral-Small} consistently exhibits sensitivity across incentive
environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display rigid
rationality, unconditional cooperation, or inconsistent and unstable behavior.
Unlike human, all GAs, regardless of size or architecture, struggle to exploit
perceived regularities in the interlocutor’s behavior. Although some models are
able to detect patterns, most fail to translate these beliefs into their own
decisions. When it comes to coordination, most generative agents struggle to
align their actions in games with multiple equilibria. This failure stems from a
limited ability to model the opponent’s behavior accurately and incorporate
theses beliefs in their practical reasoning. Although communication is expected
to improve coordination, it often introduces ambiguity instead: models generate
cooperative messages that are not followed by consistent actions, leading to
misaligned expectations and degraded coordination. Only \texttt{Qwen3} shows
reliable coordination behavior, swiftly incorporating beliefs about the
opponent’s strategy even without communication.
The key challenge for GAs lies in refining their beliefs and integrating them
effectively into decision-making to better adapt to their environment and
......
......@@ -46,12 +46,12 @@ outcome. This reflects bounded rationality, focal point reasoning, and a natural
bias toward coordination, even in the absence of explicit
signaling~\cite{cooper89rje}.
%To evaluate whether generative agents can coordinate effectively,
We use a repeated version of the BoS game. Each experiment consists of 10
rounds. In each round, the agent is required to predict the opponent’s next
move, earning $1$ point for a correct prediction and $0$ otherwise. This
prediction is then integrated into the agent’s decision-making process. To avoid
gender biais, we replace descriptive player labels and action labels with
letters. No model successfully produced a valid strategy.
We also use a repeated version of the BoS game. Each experiment consists of 10
rounds. In each round, The GA must predict the opponent’s next move, earning $1$
point for a correct prediction and $0$ otherwise. This prediction is then
integrated into the agent’s decision-making process. To avoid gender biais, we
replace descriptive player labels and action labels with letters. No model
successfully produced a valid strategy.
% We consider two coordination contexts: in the first, agents interact
% with a simulated human strategy that follows a fixed behavioral pattern; in the
......@@ -66,9 +66,9 @@ hidden alternating strategy.
% simplified strategy: alternating between the two options.
Fig.~\ref{fig:bos_prediction} (resp. Fig.~\ref{fig:bos_payoff}) illustrates the
average prediction accuracy (resp. number of points earned) per round for each
model. Models consistently failed to predict the opponent’s next move and
model. GAs consistently failed to predict the opponent’s next move and
coordinate effectively. This is mainly because they did not recognize the
opponent’s looping behavior. Instead, they assume the opponent is reactive,
opponent’s alternating behavior. Instead, they assume the opponent is reactive,
random, or goal-directed, overcomplicating a simple repeating strategy. As a
result, they tried to predict rational behavior instead of adapting to the
actual pattern.
......@@ -76,14 +76,14 @@ actual pattern.
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/bos/bos_prediction.pdf}
\caption{Prediction accuracy with a fixed strategy.}
\caption{Prediction accuracy against a fixed strategy.}
\label{fig:bos_prediction}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/bos/bos_payoff.pdf}
\caption{Average points in the coordination task with a fixed strategy.}
\caption{Average points per round against a fixed strategy.}
\label{fig:bos_payoff}
\end{figure}
......@@ -116,14 +116,13 @@ making it harder to coordinate.
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_prediction.pdf}
\caption{Prediction accuracy in the Agent-Agent
coordination task.}
\caption{Prediction accuracy against against a GA.}
\label{fig:nbos_prediction}
\end{figure}
\begin{figure}[htbp]
\centering
\includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_payoff.pdf}
\caption{Average points in the Agent-Agent coordination task.}
\caption{Average points per rounds against a GA.}
\label{fig:nbos_payoff}
\end{figure}
......@@ -82,7 +82,7 @@ action generation by various models\footnote{N/A indicates that the model failed
\textbf{\texttt{D}} & (5, 0) & (1, 1) & (10, 1) & (2, 2) & (3, 1) & (2, 2) & (8, -3) & (2, 2) \\
\bottomrule
\end{tabular}
\caption{Payoff matrices for different versions of the PD.}
\caption{Payoff matrices for different variants of the PD.}
\label{tab:pd_payoffs}
\end{table*}
......@@ -114,7 +114,7 @@ action generation by various models\footnote{N/A indicates that the model failed
& $\top$ & 0.10 & 0.13 & 0.10 & 0.00 & 0.03 & 0.10 & 0.03 & 0.11 & 0.10 & 0.00 & 0.07 & 0.03 \\
\bottomrule
\end{tabular}
\caption{Cooperation rates across different settings and versions of the PD.}
\caption{Cooperation rates across different settings and variants of the PD.}
\label{tab:model_pd_behavior}
\end{table*}
......@@ -132,15 +132,15 @@ show the payoff sensitivity expected from human-like reasoning.
it defects under the Rational prompt in high-risk or high-reward variants and
cooperates more under the Human prompt, it also modulates cooperation rates in
response to payoffs, especially under the Human role. For example, cooperation
decreases slightly in the Cooperation Loss condition, suggesting some
recognition of the increased risk of being exploited. Additionally, it is mostly
robust to anonymization.
slightly decreases under the Cooperation Loss scenario, suggesting some
recognition of the increased risk of being exploited. Additionally, it is
largely unaffected by anonymization.
In contrast, \texttt{Llama3} cooperates across all conditions and prompts,
indicating a failure to internalize role differences or payoff structures. This
model appears biased toward cooperation, likely due to training data priors,
rather than engaging in context-sensitive reasoning. Conversely, \texttt{Qwen3}
exhibits the opposite failure mode: it is overly rigid, rarely cooperating even
under Human prompts, and shows erratic drops in cooperation under anonymization,
suggesting semantic overreliance and poor role alignment.
indicating a failure to internalize role differences or payoff structures. The
model exhibits a strong predisposition to cooperate, regardless of context,
likely due to training data priors. Conversely, \texttt{Qwen3} exhibits the
opposite failure mode: it is overly rigid, rarely cooperating even under Human
prompts, and shows erratic drops in cooperation under anonymization, suggesting
semantic overreliance and poor role alignment.
......@@ -30,7 +30,7 @@ this study assesses the capabilities of models such as
We focus on their ability to
make credible one-shot decisions, generate human-like strategies, adapt to their
environment, and coordinate in social interactions\footnote{All code, prompts,
and data traces will be available in a public repository.}.
and data traces are available in a public repository~\cite{pygaamas}.}.
%All code, prompts,
% and data traces are available in a public repository~\cite{pygaamas}.
%These capabilities are evaluated through a series of
......@@ -55,7 +55,7 @@ credible behavior simulating human-like decision-making in Sec.~\ref{sec:human}.
Sec.~\ref{sec:belief} examine the ability of GAs to refine their beliefs about
an opponent's next move and to integrate these predictions into their
decision-making, while Sec.~\ref{sec:coordination} investigates how they
coordinate with other agents. The paper concludes in Sec.~\ref{sec:conclusion}.
coordinate. The paper concludes in Sec.~\ref{sec:conclusion}.
%where we summarize the main contributions and
%propose directions for future research.
@Misc{pygaamas,
author = {St\'ephane Bonnevay and Maxime Morge},
author = {Anonymous},
title = {Python Generative Autonomous Agents and Multi-Agent Systems},
howpublished = {https://gitlab.liris.cnrs.fr/mmorge/pygaamas},
howpublished = {https://zenodo.org/records/15608944},
year = {2025}
}
......@@ -302,7 +302,7 @@ doi = {10.1177/1043463195007001004}
}
@misc{hua24arxiv,
title={Game-theoretic LLM: Agent Workflow for Negotiation Games},
title={{Game-theoretic LLM: Agent Workflow for Negotiation Games}},
author={Wenyue Hua and Ollie Liu and Lingyao Li and Alfonso Amayuelas and
Julie Chen and Lucas Jiang and Mingyu Jin and Lizhou Fan and
Fei Sun and William Wang and Xintong Wang and Yongfeng Zhang},
......
No preview for this file type
......@@ -37,11 +37,11 @@ Mail}
% in the abstract
\begin{abstract}
Recent advances in Large Language Models (LLMs) have enabled the creation of
Generative Agents (GAs) capable of autonomous decision-making in interactive
settings. This paper investigates whether GAs can exhibit socially credible
Generative Agents (GAs) capable of autonomous decision-making in interaction.
This paper investigates whether GAs can exhibit socially credible
behavior. %, with a particular focus on their ability to coordinate.
Drawing from behavioral game theory, we evaluate five state-of-the-art models
across three canonical game-theoretic environments. Our results show that
across three canonical game-theoretic environments. Our results show that,
while some GAs can accurately predict their opponent’s behavior, few are able
to incorporate those predictions into decision-making. These behavioral flaws
help explain why coordination remains especially challenging: most models
......
......@@ -52,8 +52,12 @@ lacking humans’ sensitivity to incentives.
% behavior, thereby lacking the sensitivity to incentives that is characteristic
% of human-like reasoning.
While Morge~\cite{morge25paams} evaluates GAs on economic rationality and
strategic reasoning, we focus on their ability to make credible one-shot
decisions, generate human-like strategies, adapt to their environment, and
coordinate in social interactions.
Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether agents understand
Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether GAs understand
game rules and history ex post, but not whether this informs their decisions. We
instead evaluate if agents explicitly incorporate beliefs and opponent modeling
into their strategies.
......@@ -71,7 +75,7 @@ failing, for instance, to adopt basic conventions such as alternation in the
Battle of the Sexes game. To address this, they propose prompting agents to
imagine possible actions and their consequences before deciding. However, this
conditional reasoning proves effective mainly for smaller models and may degrade
performance in larger ones due to added complexity. %~\cite{pygaamas}
performance in larger ones due to added complexity~\cite{pygaamas}.
While Akata \textit{et al.} attribute these failures to limited predictive
ability and a tendency to rigidly favor preferred options, we argue that the
most fundamental cause is GAs' inability to incorporate their beliefs into the
......@@ -112,10 +116,6 @@ models that can run on standard hardware. %~\cite{pygaamas}
% prompting LLMs to generate algorithmic strategies, as in~\cite{willis25arxiv},
% rather than issuing multiple one-shot queries.
While Morge~\cite{morge25paams} evaluates GAs
on economic rationality and strategic reasoning, we focus on their ability to
make credible one-shot decisions, generate human-like strategies, adapt to their
environment, and coordinate in social interactions.
Hua \textit{et al.}~\cite{hua24arxiv} show that GAs deviate from rationality as
game complexity increases, and highlight the role of communication in fostering
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment