diff --git a/doc/paper/ICTAI25/belief.tex b/doc/paper/ICTAI25/belief.tex index a27a3cde212a1b74f02d7169e3d7867657801bc8..6f8263cf8c6fdb0826ffbd7ef4d4a7f68cdabab3 100644 --- a/doc/paper/ICTAI25/belief.tex +++ b/doc/paper/ICTAI25/belief.tex @@ -57,9 +57,8 @@ analyzing the frequency of the opponent’s past moves and selecting the less common one. \texttt{Qwen3}, by contrast, relies on randomness, choosing moves unpredictably while presuming the opponent will mirror its choice. \texttt{LLama3} does not implement a functioning strategy. Overall, these -model-generated strategies are simplistic and heuristic-based, often lacking the -credibility and adaptability needed for effective play in adversarial settings -like MP. +model-generated strategies are simplistic and heuristic-driven, often lacking +the credibility and adaptability required to simulate human behavior. Fig.~\ref{fig:mp_prediction_constant} (resp. Fig.~\ref{fig:mp_payoff_constant}) illustrates the average prediction accuracy (resp. the number of points earned) @@ -73,14 +72,14 @@ information into their action selection to choose the winning move. \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_constant.pdf} - \caption{Prediction accuracy per round against a constant opponent strategy.} + \caption{Prediction accuracy per round against a constant strategy.} \label{fig:mp_prediction_constant} \end{figure} \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_constant.pdf} - \caption{Average points per round against a constant opponent strategy.} + \caption{Average points per round against a constant strategy.} \label{fig:mp_payoff_constant} \end{figure} @@ -93,13 +92,13 @@ alternating strategy, is barely better than a random strategy. \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_altern.pdf} - \caption{Prediction accuracy per round against an alternating opponent strategy.} + \caption{Prediction accuracy per round against an alternating strategy.} \label{fig:mp_prediction_altern} \end{figure} \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_altern.pdf} - \caption{Average points per round against an alternating opponent strategy.} + \caption{Average points per round against an alternating strategy.} \label{fig:mp_payoff_altern} \end{figure} diff --git a/doc/paper/ICTAI25/conclusion.tex b/doc/paper/ICTAI25/conclusion.tex index 1668100dfe5e9c2cadb3a900fc1c64626e4b4cc6..fd8002769cc7aff43fa8fec1307ab0edf658ff92 100644 --- a/doc/paper/ICTAI25/conclusion.tex +++ b/doc/paper/ICTAI25/conclusion.tex @@ -1,24 +1,24 @@ \section{Conclusion} \label{sec:conclusion} -In this paper, we evaluate the ability of generative agents to exhibit credible -behavior in social situations, adapt to their interlocutor, and coordinate with -them. \texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness, -but only \texttt{Mistral-Small} consistently exhibits sensitivity across -incentive environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display -rigid rationality, unconditional cooperation, or inconsistent and unstable -behavior. Unlike human, all models, regardless of size or architecture, struggle -to exploit perceived regularities in the interlocutor’s behavior. Although some -models are able to detect patterns, most fail to translate these beliefs into -their own decisions. When it comes to coordination, most generative agents -struggle to align their actions in games with multiple equilibria. This failure -stems from a limited ability to model the opponent’s behavior accurately and -incorporate theses beliefs in their practical reasoning. Although communication -is expected to improve coordination, it often introduces ambiguity instead: -models generate cooperative messages that are not followed by consistent -actions, leading to misaligned expectations and degraded coordination. Only -\texttt{Qwen3} shows reliable coordination behavior, swiftly incorporating -beliefs about the opponent’s strategy even without communication. +In this paper, we evaluated whether GAs can act in socially plausible ways, +align their strategies with others, and adapt dynamically to their environment. +\texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness, but only +\texttt{Mistral-Small} consistently exhibits sensitivity across incentive +environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display rigid +rationality, unconditional cooperation, or inconsistent and unstable behavior. +Unlike human, all GAs, regardless of size or architecture, struggle to exploit +perceived regularities in the interlocutor’s behavior. Although some models are +able to detect patterns, most fail to translate these beliefs into their own +decisions. When it comes to coordination, most generative agents struggle to +align their actions in games with multiple equilibria. This failure stems from a +limited ability to model the opponent’s behavior accurately and incorporate +theses beliefs in their practical reasoning. Although communication is expected +to improve coordination, it often introduces ambiguity instead: models generate +cooperative messages that are not followed by consistent actions, leading to +misaligned expectations and degraded coordination. Only \texttt{Qwen3} shows +reliable coordination behavior, swiftly incorporating beliefs about the +opponent’s strategy even without communication. The key challenge for GAs lies in refining their beliefs and integrating them effectively into decision-making to better adapt to their environment and diff --git a/doc/paper/ICTAI25/coordination.tex b/doc/paper/ICTAI25/coordination.tex index 58945cd902e0530a1e43ddc37c9bd43ade247b25..4d85bed3d73c1cce453d6a06ffab2957d1703671 100644 --- a/doc/paper/ICTAI25/coordination.tex +++ b/doc/paper/ICTAI25/coordination.tex @@ -46,12 +46,12 @@ outcome. This reflects bounded rationality, focal point reasoning, and a natural bias toward coordination, even in the absence of explicit signaling~\cite{cooper89rje}. %To evaluate whether generative agents can coordinate effectively, -We use a repeated version of the BoS game. Each experiment consists of 10 -rounds. In each round, the agent is required to predict the opponent’s next -move, earning $1$ point for a correct prediction and $0$ otherwise. This -prediction is then integrated into the agent’s decision-making process. To avoid -gender biais, we replace descriptive player labels and action labels with -letters. No model successfully produced a valid strategy. +We also use a repeated version of the BoS game. Each experiment consists of 10 +rounds. In each round, The GA must predict the opponent’s next move, earning $1$ +point for a correct prediction and $0$ otherwise. This prediction is then +integrated into the agent’s decision-making process. To avoid gender biais, we +replace descriptive player labels and action labels with letters. No model +successfully produced a valid strategy. % We consider two coordination contexts: in the first, agents interact % with a simulated human strategy that follows a fixed behavioral pattern; in the @@ -66,9 +66,9 @@ hidden alternating strategy. % simplified strategy: alternating between the two options. Fig.~\ref{fig:bos_prediction} (resp. Fig.~\ref{fig:bos_payoff}) illustrates the average prediction accuracy (resp. number of points earned) per round for each -model. Models consistently failed to predict the opponent’s next move and +model. GAs consistently failed to predict the opponent’s next move and coordinate effectively. This is mainly because they did not recognize the -opponent’s looping behavior. Instead, they assume the opponent is reactive, +opponent’s alternating behavior. Instead, they assume the opponent is reactive, random, or goal-directed, overcomplicating a simple repeating strategy. As a result, they tried to predict rational behavior instead of adapting to the actual pattern. @@ -76,14 +76,14 @@ actual pattern. \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/bos/bos_prediction.pdf} - \caption{Prediction accuracy with a fixed strategy.} + \caption{Prediction accuracy against a fixed strategy.} \label{fig:bos_prediction} \end{figure} \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/bos/bos_payoff.pdf} - \caption{Average points in the coordination task with a fixed strategy.} + \caption{Average points per round against a fixed strategy.} \label{fig:bos_payoff} \end{figure} @@ -116,14 +116,13 @@ making it harder to coordinate. \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_prediction.pdf} - \caption{Prediction accuracy in the Agent-Agent - coordination task.} + \caption{Prediction accuracy against against a GA.} \label{fig:nbos_prediction} \end{figure} \begin{figure}[htbp] \centering \includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_payoff.pdf} - \caption{Average points in the Agent-Agent coordination task.} + \caption{Average points per rounds against a GA.} \label{fig:nbos_payoff} \end{figure} diff --git a/doc/paper/ICTAI25/human.tex b/doc/paper/ICTAI25/human.tex index 832346cb43505a6b3782d8920a24b6f65e729382..8b571ccf0e1c00a55e5e3a950c5e7ba2907a5835 100644 --- a/doc/paper/ICTAI25/human.tex +++ b/doc/paper/ICTAI25/human.tex @@ -82,7 +82,7 @@ action generation by various models\footnote{N/A indicates that the model failed \textbf{\texttt{D}} & (5, 0) & (1, 1) & (10, 1) & (2, 2) & (3, 1) & (2, 2) & (8, -3) & (2, 2) \\ \bottomrule \end{tabular} -\caption{Payoff matrices for different versions of the PD.} +\caption{Payoff matrices for different variants of the PD.} \label{tab:pd_payoffs} \end{table*} @@ -114,7 +114,7 @@ action generation by various models\footnote{N/A indicates that the model failed & $\top$ & 0.10 & 0.13 & 0.10 & 0.00 & 0.03 & 0.10 & 0.03 & 0.11 & 0.10 & 0.00 & 0.07 & 0.03 \\ \bottomrule \end{tabular} -\caption{Cooperation rates across different settings and versions of the PD.} +\caption{Cooperation rates across different settings and variants of the PD.} \label{tab:model_pd_behavior} \end{table*} @@ -132,15 +132,15 @@ show the payoff sensitivity expected from human-like reasoning. it defects under the Rational prompt in high-risk or high-reward variants and cooperates more under the Human prompt, it also modulates cooperation rates in response to payoffs, especially under the Human role. For example, cooperation -decreases slightly in the Cooperation Loss condition, suggesting some -recognition of the increased risk of being exploited. Additionally, it is mostly -robust to anonymization. +slightly decreases under the Cooperation Loss scenario, suggesting some +recognition of the increased risk of being exploited. Additionally, it is +largely unaffected by anonymization. In contrast, \texttt{Llama3} cooperates across all conditions and prompts, -indicating a failure to internalize role differences or payoff structures. This -model appears biased toward cooperation, likely due to training data priors, -rather than engaging in context-sensitive reasoning. Conversely, \texttt{Qwen3} -exhibits the opposite failure mode: it is overly rigid, rarely cooperating even -under Human prompts, and shows erratic drops in cooperation under anonymization, -suggesting semantic overreliance and poor role alignment. +indicating a failure to internalize role differences or payoff structures. The +model exhibits a strong predisposition to cooperate, regardless of context, +likely due to training data priors. Conversely, \texttt{Qwen3} exhibits the +opposite failure mode: it is overly rigid, rarely cooperating even under Human +prompts, and shows erratic drops in cooperation under anonymization, suggesting +semantic overreliance and poor role alignment. diff --git a/doc/paper/ICTAI25/introduction.tex b/doc/paper/ICTAI25/introduction.tex index 4df9819c38ea367c38cf60e4a71c4b918985aa7e..53f4df4641c50e4bc7485e8a653b4520d47d2457 100644 --- a/doc/paper/ICTAI25/introduction.tex +++ b/doc/paper/ICTAI25/introduction.tex @@ -30,7 +30,7 @@ this study assesses the capabilities of models such as We focus on their ability to make credible one-shot decisions, generate human-like strategies, adapt to their environment, and coordinate in social interactions\footnote{All code, prompts, - and data traces will be available in a public repository.}. +and data traces are available in a public repository~\cite{pygaamas}.}. %All code, prompts, % and data traces are available in a public repository~\cite{pygaamas}. %These capabilities are evaluated through a series of @@ -55,7 +55,7 @@ credible behavior simulating human-like decision-making in Sec.~\ref{sec:human}. Sec.~\ref{sec:belief} examine the ability of GAs to refine their beliefs about an opponent's next move and to integrate these predictions into their decision-making, while Sec.~\ref{sec:coordination} investigates how they -coordinate with other agents. The paper concludes in Sec.~\ref{sec:conclusion}. +coordinate. The paper concludes in Sec.~\ref{sec:conclusion}. %where we summarize the main contributions and %propose directions for future research. diff --git a/doc/paper/ICTAI25/morge25ictai.bib b/doc/paper/ICTAI25/morge25ictai.bib index 1ef20cf7db57a69739c34f2f529e6c9997d01700..f7ac7ab867253b4d70ffa9243c0a5f3087ca3e1c 100644 --- a/doc/paper/ICTAI25/morge25ictai.bib +++ b/doc/paper/ICTAI25/morge25ictai.bib @@ -1,7 +1,7 @@ @Misc{pygaamas, - author = {St\'ephane Bonnevay and Maxime Morge}, + author = {Anonymous}, title = {Python Generative Autonomous Agents and Multi-Agent Systems}, - howpublished = {https://gitlab.liris.cnrs.fr/mmorge/pygaamas}, + howpublished = {https://zenodo.org/records/15608944}, year = {2025} } @@ -302,7 +302,7 @@ doi = {10.1177/1043463195007001004} } @misc{hua24arxiv, - title={Game-theoretic LLM: Agent Workflow for Negotiation Games}, + title={{Game-theoretic LLM: Agent Workflow for Negotiation Games}}, author={Wenyue Hua and Ollie Liu and Lingyao Li and Alfonso Amayuelas and Julie Chen and Lucas Jiang and Mingyu Jin and Lizhou Fan and Fei Sun and William Wang and Xintong Wang and Yongfeng Zhang}, diff --git a/doc/paper/ICTAI25/morge25ictai.pdf b/doc/paper/ICTAI25/morge25ictai.pdf index 72881f457581c628c99ce3481df2b0f41d34a4cc..246d446c6eb7a08a89800830e90f8b4bc1f23ce0 100644 Binary files a/doc/paper/ICTAI25/morge25ictai.pdf and b/doc/paper/ICTAI25/morge25ictai.pdf differ diff --git a/doc/paper/ICTAI25/morge25ictai.tex b/doc/paper/ICTAI25/morge25ictai.tex index 88f87c05c5e5ecc7d4854b1594fa6f9f9e8a5cab..186c314449e445150a300e0c3c11d788bcda23a2 100644 --- a/doc/paper/ICTAI25/morge25ictai.tex +++ b/doc/paper/ICTAI25/morge25ictai.tex @@ -37,11 +37,11 @@ Mail} % in the abstract \begin{abstract} Recent advances in Large Language Models (LLMs) have enabled the creation of - Generative Agents (GAs) capable of autonomous decision-making in interactive - settings. This paper investigates whether GAs can exhibit socially credible + Generative Agents (GAs) capable of autonomous decision-making in interaction. + This paper investigates whether GAs can exhibit socially credible behavior. %, with a particular focus on their ability to coordinate. Drawing from behavioral game theory, we evaluate five state-of-the-art models - across three canonical game-theoretic environments. Our results show that + across three canonical game-theoretic environments. Our results show that, while some GAs can accurately predict their opponent’s behavior, few are able to incorporate those predictions into decision-making. These behavioral flaws help explain why coordination remains especially challenging: most models diff --git a/doc/paper/ICTAI25/related.tex b/doc/paper/ICTAI25/related.tex index bc3c972afad9b176ddf0cb0a32dd81d047b28716..756120041159565c8f061a21675c54016b96700a 100644 --- a/doc/paper/ICTAI25/related.tex +++ b/doc/paper/ICTAI25/related.tex @@ -52,8 +52,12 @@ lacking humans’ sensitivity to incentives. % behavior, thereby lacking the sensitivity to incentives that is characteristic % of human-like reasoning. +While Morge~\cite{morge25paams} evaluates GAs on economic rationality and +strategic reasoning, we focus on their ability to make credible one-shot +decisions, generate human-like strategies, adapt to their environment, and +coordinate in social interactions. -Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether agents understand +Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether GAs understand game rules and history ex post, but not whether this informs their decisions. We instead evaluate if agents explicitly incorporate beliefs and opponent modeling into their strategies. @@ -71,7 +75,7 @@ failing, for instance, to adopt basic conventions such as alternation in the Battle of the Sexes game. To address this, they propose prompting agents to imagine possible actions and their consequences before deciding. However, this conditional reasoning proves effective mainly for smaller models and may degrade -performance in larger ones due to added complexity. %~\cite{pygaamas} +performance in larger ones due to added complexity~\cite{pygaamas}. While Akata \textit{et al.} attribute these failures to limited predictive ability and a tendency to rigidly favor preferred options, we argue that the most fundamental cause is GAs' inability to incorporate their beliefs into the @@ -112,10 +116,6 @@ models that can run on standard hardware. %~\cite{pygaamas} % prompting LLMs to generate algorithmic strategies, as in~\cite{willis25arxiv}, % rather than issuing multiple one-shot queries. -While Morge~\cite{morge25paams} evaluates GAs -on economic rationality and strategic reasoning, we focus on their ability to -make credible one-shot decisions, generate human-like strategies, adapt to their -environment, and coordinate in social interactions. Hua \textit{et al.}~\cite{hua24arxiv} show that GAs deviate from rationality as game complexity increases, and highlight the role of communication in fostering