PyGAAMAS: Add Anonymous repositery

9c17189d · Maxime Morge · baa10a40 · 9c17189d · 9c17189d · 9c17189d
Commit 9c17189d authored 1 month ago by Maxime Morge
--- a/doc/paper/ICTAI25/belief.tex
+++ b/doc/paper/ICTAI25/belief.tex
@@ -57,9 +57,8 @@ analyzing the frequency of the opponent’s past moves and selecting the less
 common one. \texttt{Qwen3}, by contrast, relies on randomness, choosing moves
 unpredictably while presuming the opponent will mirror its choice.
 \texttt{LLama3} does not implement a functioning strategy. Overall, these
-model-generated strategies are simplistic and heuristic-based, often lacking the
-credibility and adaptability needed for effective play in adversarial settings
-like MP.
+model-generated strategies are simplistic and heuristic-driven, often lacking
+the credibility and adaptability required to simulate human behavior.

 Fig.~\ref{fig:mp_prediction_constant} (resp. Fig.~\ref{fig:mp_payoff_constant})
 illustrates the average prediction accuracy (resp. the number of points earned)
@@ -73,14 +72,14 @@ information into their action selection to choose the winning move.
 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_constant.pdf}
-     \caption{Prediction accuracy per round against a constant opponent strategy.}
+     \caption{Prediction accuracy per round against a constant strategy.}
    \label{fig:mp_prediction_constant}
 \end{figure}

 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_constant.pdf}
-    \caption{Average points per round against a constant opponent strategy.}
+    \caption{Average points per round against a constant strategy.}
    \label{fig:mp_payoff_constant}
 \end{figure}

@@ -93,13 +92,13 @@ alternating strategy, is barely better than a random strategy.
 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/mp/mp_prediction_altern.pdf}
-    \caption{Prediction accuracy per round against an alternating opponent strategy.}
+    \caption{Prediction accuracy per round against an alternating strategy.}
    \label{fig:mp_prediction_altern}
 \end{figure}

 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/mp/mp_payoff_altern.pdf}
-    \caption{Average points per round against an alternating opponent strategy.}
+    \caption{Average points per round against an alternating strategy.}
    \label{fig:mp_payoff_altern}
 \end{figure}
--- a/doc/paper/ICTAI25/conclusion.tex
+++ b/doc/paper/ICTAI25/conclusion.tex
 \section{Conclusion}
 \label{sec:conclusion}

-In this paper, we evaluate the ability of generative agents to exhibit credible
-behavior in social situations, adapt to their interlocutor, and coordinate with
-them. \texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness,
-but only \texttt{Mistral-Small} consistently exhibits sensitivity across
-incentive environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display
-rigid rationality, unconditional cooperation, or inconsistent and unstable
-behavior. Unlike human, all models, regardless of size or architecture, struggle
-to exploit perceived regularities in the interlocutor’s behavior. Although some
-models are able to detect patterns, most fail to translate these beliefs into
-their own decisions. When it comes to coordination, most generative agents
-struggle to align their actions in games with multiple equilibria. This failure
-stems from a limited ability to model the opponent’s behavior accurately and
-incorporate theses beliefs in their practical reasoning. Although communication
-is expected to improve coordination, it often introduces ambiguity instead:
-models generate cooperative messages that are not followed by consistent
-actions, leading to misaligned expectations and degraded coordination. Only
-\texttt{Qwen3} shows reliable coordination behavior, swiftly incorporating
-beliefs about the opponent’s strategy even without communication.
+In this paper, we evaluated whether GAs can act in socially plausible ways,
+align their strategies with others, and adapt dynamically to their environment.
+\texttt{GPT-4.5} and \texttt{Mistral-Small} demonstrate human-likeness, but only
+\texttt{Mistral-Small} consistently exhibits sensitivity across incentive
+environments. In contrast, \texttt{LLama3} and \texttt{Qwen3} display rigid
+rationality, unconditional cooperation, or inconsistent and unstable behavior.
+Unlike human, all GAs, regardless of size or architecture, struggle to exploit
+perceived regularities in the interlocutor’s behavior. Although some models are
+able to detect patterns, most fail to translate these beliefs into their own
+decisions. When it comes to coordination, most generative agents struggle to
+align their actions in games with multiple equilibria. This failure stems from a
+limited ability to model the opponent’s behavior accurately and incorporate
+theses beliefs in their practical reasoning. Although communication is expected
+to improve coordination, it often introduces ambiguity instead: models generate
+cooperative messages that are not followed by consistent actions, leading to
+misaligned expectations and degraded coordination. Only \texttt{Qwen3} shows
+reliable coordination behavior, swiftly incorporating beliefs about the
+opponent’s strategy even without communication.

 The key challenge for GAs lies in refining their beliefs and integrating them
 effectively into decision-making to better adapt to their environment and

--- a/doc/paper/ICTAI25/coordination.tex
+++ b/doc/paper/ICTAI25/coordination.tex
@@ -46,12 +46,12 @@ outcome. This reflects bounded rationality, focal point reasoning, and a natural
 bias toward coordination, even in the absence of explicit
 signaling~\cite{cooper89rje}.
 %To evaluate whether generative agents can coordinate effectively,
-We use a repeated version of the BoS game. Each experiment consists of 10
-rounds. In each round, the agent is required to predict the opponent’s next
-move, earning $1$ point for a correct prediction and $0$ otherwise. This
-prediction is then integrated into the agent’s decision-making process. To avoid
-gender biais, we replace descriptive player labels and action labels with
-letters. No model successfully produced a valid strategy.
+We also use a repeated version of the BoS game. Each experiment consists of 10
+rounds. In each round, The GA must predict the opponent’s next move, earning $1$
+point for a correct prediction and $0$ otherwise. This prediction is then
+integrated into the agent’s decision-making process. To avoid gender biais, we
+replace descriptive player labels and action labels with letters. No model
+successfully produced a valid strategy.

 % We consider two coordination contexts: in the first, agents interact
 % with a simulated human strategy that follows a fixed behavioral pattern; in the
@@ -66,9 +66,9 @@ hidden alternating strategy.
 % simplified strategy: alternating between the two options.
 Fig.~\ref{fig:bos_prediction} (resp. Fig.~\ref{fig:bos_payoff}) illustrates the
 average prediction accuracy (resp. number of points earned) per round for each
-model. Models consistently failed to predict the opponent’s next move and
+model. GAs consistently failed to predict the opponent’s next move and
 coordinate effectively. This is mainly because they did not recognize the
-opponent’s looping behavior. Instead, they assume the opponent is reactive,
+opponent’s alternating behavior. Instead, they assume the opponent is reactive,
 random, or goal-directed, overcomplicating a simple repeating strategy. As a
 result, they tried to predict rational behavior instead of adapting to the
 actual pattern.
@@ -76,14 +76,14 @@ actual pattern.
 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/bos/bos_prediction.pdf}
-    \caption{Prediction accuracy with a fixed strategy.}
+    \caption{Prediction accuracy against a fixed strategy.}
    \label{fig:bos_prediction}
 \end{figure}

 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/bos/bos_payoff.pdf}
-    \caption{Average points in the coordination task with a fixed strategy.}
+    \caption{Average points per round against a fixed strategy.}
    \label{fig:bos_payoff}
 \end{figure}

@@ -116,14 +116,13 @@ making it harder to coordinate.
 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_prediction.pdf}
-    \caption{Prediction accuracy in the Agent-Agent
-      coordination task.}
+    \caption{Prediction accuracy against against a GA.}
    \label{fig:nbos_prediction}
 \end{figure}

 \begin{figure}[htbp]
    \centering
    \includegraphics[width=\columnwidth]{figures/figures/nbos/nbos_payoff.pdf}
-    \caption{Average points in the Agent-Agent coordination task.}
+    \caption{Average points per rounds against a GA.}
    \label{fig:nbos_payoff}
 \end{figure}
--- a/doc/paper/ICTAI25/human.tex
+++ b/doc/paper/ICTAI25/human.tex
@@ -82,7 +82,7 @@ action generation by various models\footnote{N/A indicates that the model failed
 \textbf{\texttt{D}}    & (5, 0)   & (1, 1)   & (10, 1)  & (2, 2)   & (3, 1)     & (2, 2)   & (8, -3)    & (2, 2)  \\
 \bottomrule
 \end{tabular}
-\caption{Payoff matrices for different versions of the PD.}
+\caption{Payoff matrices for different variants of the PD.}
 \label{tab:pd_payoffs}
 \end{table*}

@@ -114,7 +114,7 @@ action generation by various models\footnote{N/A indicates that the model failed
                                  & $\top$  & 0.10 & 0.13 & 0.10 & 0.00 & 0.03 & 0.10 & 0.03 & 0.11 & 0.10 & 0.00 & 0.07 & 0.03 \\
 \bottomrule
 \end{tabular}
-\caption{Cooperation rates across different settings and versions of the PD.}
+\caption{Cooperation rates across different settings and variants of the PD.}
 \label{tab:model_pd_behavior}
 \end{table*}

@@ -132,15 +132,15 @@ show the payoff sensitivity expected from human-like reasoning.
 it defects under the Rational prompt in high-risk or high-reward variants and
 cooperates more under the Human prompt, it also modulates cooperation rates in
 response to payoffs, especially under the Human role. For example, cooperation
-decreases slightly in the Cooperation Loss condition, suggesting some
-recognition of the increased risk of being exploited. Additionally, it is mostly
-robust to anonymization.
+slightly decreases under the Cooperation Loss scenario, suggesting some
+recognition of the increased risk of being exploited. Additionally, it is
+largely unaffected by anonymization.

 In contrast, \texttt{Llama3} cooperates across all conditions and prompts,
-indicating a failure to internalize role differences or payoff structures. This
-model appears biased toward cooperation, likely due to training data priors,
-rather than engaging in context-sensitive reasoning. Conversely, \texttt{Qwen3}
-exhibits the opposite failure mode: it is overly rigid, rarely cooperating even
-under Human prompts, and shows erratic drops in cooperation under anonymization,
-suggesting semantic overreliance and poor role alignment.
+indicating a failure to internalize role differences or payoff structures. The
+model exhibits a strong predisposition to cooperate, regardless of context,
+likely due to training data priors. Conversely, \texttt{Qwen3} exhibits the
+opposite failure mode: it is overly rigid, rarely cooperating even under Human
+prompts, and shows erratic drops in cooperation under anonymization, suggesting
+semantic overreliance and poor role alignment.

--- a/doc/paper/ICTAI25/introduction.tex
+++ b/doc/paper/ICTAI25/introduction.tex
@@ -30,7 +30,7 @@ this study assesses the capabilities of models such as
 We focus on their ability to
 make credible one-shot decisions, generate human-like strategies, adapt to their
 environment, and coordinate in social interactions\footnote{All code, prompts,
-  and data traces will be available in a public repository.}.
+and data traces are available in a public repository~\cite{pygaamas}.}.
 %All code, prompts,
 %  and data traces are available in a public repository~\cite{pygaamas}.
 %These capabilities are evaluated through a series of
@@ -55,7 +55,7 @@ credible behavior simulating human-like decision-making in Sec.~\ref{sec:human}.
 Sec.~\ref{sec:belief} examine the ability of GAs to refine their beliefs about
 an opponent's next move and to integrate these predictions into their
 decision-making, while Sec.~\ref{sec:coordination} investigates how they
-coordinate with other agents. The paper concludes in Sec.~\ref{sec:conclusion}.
+coordinate. The paper concludes in Sec.~\ref{sec:conclusion}.
 %where we summarize the main contributions and
 %propose directions for future research.

--- a/doc/paper/ICTAI25/morge25ictai.bib
+++ b/doc/paper/ICTAI25/morge25ictai.bib
 @Misc{pygaamas,
-  author =    {St\'ephane Bonnevay and Maxime Morge},
+  author =    {Anonymous},
  title =     {Python Generative Autonomous Agents and Multi-Agent Systems},
-  howpublished = {https://gitlab.liris.cnrs.fr/mmorge/pygaamas},
+  howpublished = {https://zenodo.org/records/15608944},
  year =      {2025}
 }

@@ -302,7 +302,7 @@ doi = {10.1177/1043463195007001004}
 }

 @misc{hua24arxiv,
-      title={Game-theoretic LLM: Agent Workflow for Negotiation Games}, 
+      title={{Game-theoretic LLM: Agent Workflow for Negotiation Games}}, 
      author={Wenyue Hua and Ollie Liu and Lingyao Li and Alfonso Amayuelas and
                  Julie Chen and Lucas Jiang and Mingyu Jin and Lizhou Fan and
                  Fei Sun and William Wang and Xintong Wang and Yongfeng Zhang},

--- a/doc/paper/ICTAI25/morge25ictai.pdf
+++ b/doc/paper/ICTAI25/morge25ictai.pdf
--- a/doc/paper/ICTAI25/morge25ictai.tex
+++ b/doc/paper/ICTAI25/morge25ictai.tex
@@ -37,11 +37,11 @@ Mail}
 % in the abstract
 \begin{abstract}
  Recent advances in Large Language Models (LLMs) have enabled the creation of
-  Generative Agents (GAs) capable of autonomous decision-making in interactive
-  settings. This paper investigates whether GAs can exhibit socially credible
+  Generative Agents (GAs) capable of autonomous decision-making in interaction.
+  This paper investigates whether GAs can exhibit socially credible
  behavior. %, with a particular focus on their ability to coordinate.
  Drawing from behavioral game theory, we evaluate five state-of-the-art models
-  across three canonical game-theoretic environments. Our results show that
+  across three canonical game-theoretic environments. Our results show that,
  while some GAs can accurately predict their opponent’s behavior, few are able
  to incorporate those predictions into decision-making. These behavioral flaws
  help explain why coordination remains especially challenging: most models

--- a/doc/paper/ICTAI25/related.tex
+++ b/doc/paper/ICTAI25/related.tex
@@ -52,8 +52,12 @@ lacking humans’ sensitivity to incentives.
 % behavior, thereby lacking the sensitivity to incentives that is characteristic
 % of human-like reasoning.

+While Morge~\cite{morge25paams} evaluates GAs on economic rationality and
+strategic reasoning, we focus on their ability to make credible one-shot
+decisions, generate human-like strategies, adapt to their environment, and
+coordinate in social interactions.

-Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether agents understand
+Fontana \textit{et al.}~\cite{fontana24arxiv} assess whether GAs understand
 game rules and history ex post, but not whether this informs their decisions. We
 instead evaluate if agents explicitly incorporate beliefs and opponent modeling
 into their strategies.
@@ -71,7 +75,7 @@ failing, for instance, to adopt basic conventions such as alternation in the
 Battle of the Sexes game. To address this, they propose prompting agents to
 imagine possible actions and their consequences before deciding. However, this
 conditional reasoning proves effective mainly for smaller models and may degrade
-performance in larger ones due to added complexity. %~\cite{pygaamas}
+performance in larger ones due to added complexity~\cite{pygaamas}.
 While Akata \textit{et al.} attribute these failures to limited predictive
 ability and a tendency to rigidly favor preferred options, we argue that the
 most fundamental cause is GAs' inability to incorporate their beliefs into the
@@ -112,10 +116,6 @@ models that can run on standard hardware. %~\cite{pygaamas}
 % prompting LLMs to generate algorithmic strategies, as in~\cite{willis25arxiv},
 % rather than issuing multiple one-shot queries.

-While Morge~\cite{morge25paams} evaluates GAs
-on economic rationality and strategic reasoning, we focus on their ability to
-make credible one-shot decisions, generate human-like strategies, adapt to their
-environment, and coordinate in social interactions.

 Hua \textit{et al.}~\cite{hua24arxiv} show that GAs deviate from rationality as
 game complexity increases, and highlight the role of communication in fostering