Skip to content
Snippets Groups Projects
Commit 35babfbe authored by Maxime Morge's avatar Maxime Morge :construction_worker:
Browse files

PyGAAMAS: XP conclusions about role-playing, payoff sensibility and semantic robustness

parent 53fb93dd
No related branches found
No related tags found
No related merge requests found
...@@ -559,11 +559,15 @@ rather than context-sensitive reasoning. <tt>Qwen3</tt> exhibits the opposite fa ...@@ -559,11 +559,15 @@ rather than context-sensitive reasoning. <tt>Qwen3</tt> exhibits the opposite fa
rarely cooperating even under <tt>Human</tt> prompts, and shows erratic drops in cooperation under anonymization, rarely cooperating even under <tt>Human</tt> prompts, and shows erratic drops in cooperation under anonymization,
indicating semantic overreliance and poor role alignment. indicating semantic overreliance and poor role alignment.
It is worth noting that most LLMs are unable to generate strategies for this game, and the strategies they do generate
are insensitive to the role being played.
Overall, few models achieve the desired trifecta of role fidelity (behaving distinctly across prompts), Overall, few models achieve the desired trifecta of role fidelity (behaving distinctly across prompts),
payoff awareness (adjusting behavior with incentives), and semantic robustness payoff awareness (adjusting behavior with incentives), and semantic robustness
(insensitivity to superficial label changes). (insensitivity to superficial label changes).
Most lean toward either rigid rationality, indiscriminate cooperation, or unstable, incoherent behavior. Most lean toward either rigid rationality, indiscriminate cooperation, or unstable, incoherent behavior.
| **Version** | | **Classic** | | | **High** | | | **Mild** | | | **Coop. Loss** | | | | **Version** | | **Classic** | | | **High** | | | **Mild** | | | **Coop. Loss** | | |
|---------------------|----------------|--------------|-------------|-----------|--------------|-------------|-----------|--------------|-------------|-----------|-----------------|-------------|-----------| |---------------------|----------------|--------------|-------------|-----------|--------------|-------------|-----------|--------------|-------------|-----------|-----------------|-------------|-----------|
| **Model** | **Generation** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | | **Model** | **Generation** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** |
...@@ -588,7 +592,7 @@ Most lean toward either rigid rationality, indiscriminate cooperation, or unstab ...@@ -588,7 +592,7 @@ Most lean toward either rigid rationality, indiscriminate cooperation, or unstab
Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of
decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making,
with <tt>Llama3</tt> showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency. with <tt>Llama3</tt> showing moderate adherence and <tt>DeepSeek-R1</tt> displaying considerable inconsistency.
<tt>Qwen3</tt> performs moderately well, showing rational behavior but struggling with implicit reasoning. <tt>Qwen3</tt> performs moderately well, showing rational behavior but struggling with implicit reasoning.
<tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, <tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences,
...@@ -607,9 +611,16 @@ but struggles with deeper inferential reasoning. ...@@ -607,9 +611,16 @@ but struggles with deeper inferential reasoning.
All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents
into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs
into optimal responses. Only <tt>Llama3.3:latest<//tt> shows any reliable ability to infer and act on into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to infer and act on
opponents’ simple behavior. opponents’ simple behavior.
Whether generating actions or strategies, most LLMs tend to exhibit either rigid rationality,
indiscriminate cooperation, or unstable and incoherent behavior.
Except for <tt>Mistral-Small</tt>, the models do not achieve the desired combination of three criteria:
the ability to adopt a role (behaving differently based on instructions),
payoff sensitivity (adjusting behavior according to incentives),
and semantic robustness (remaining unaffected by superficial label changes).
## Authors ## Authors
Maxime MORGE Maxime MORGE
......
...@@ -156,12 +156,34 @@ class PD: ...@@ -156,12 +156,34 @@ class PD:
def apply_strategy(self) -> Dict: def apply_strategy(self) -> Dict:
"""Hard-coded strategy: .""" """Hard-coded strategy: ."""
return { if self.model == "gpt-4.5-preview-2025-02-27":
"action": "D", if (R - P) > 1 and (S >= 0): # Favoring cooperation in milder scenarios
"rationality": 1.0, action = self.Cooperate
"reasoning": "Defection dominates in a single-round Prisoner's Dilemma regardless of what the opponent does." rationality = False
} reasoning = (
f"I chose {self.Cooperate} because the reward for mutual cooperation (R={R}) is significantly better than mutual defection (P={P}), "
f"and the risk of being betrayed (S={S}) is tolerable."
)
else:
action = self.Defect
rationality = True
reasoning = (
f"I chose {self.Defect} because the temptation payoff (T={T}) and punishment (P={P}) make it more advantageous or safer than risking betrayal (S={S})."
)
return {
"action": action,
"rationality": rationality,
"reasoning": reasoning
}
if self.model == "mistral-small" or "qwen3" or "llama3.3:latest" or "mixtral:8x7b" :
return None
if self.model == "llama3" or self.model == "deepseek-r1" or self.model == "qwen3":
return {
"action": self.Cooperate,
"rationality": False,
"reasoning": "I'm playing fairly"
}
return None
async def run_pagoda(self, instruction) -> Dict: async def run_pagoda(self, instruction) -> Dict:
url = self.base_url url = self.base_url
...@@ -272,6 +294,6 @@ if __name__ == "__main__": ...@@ -272,6 +294,6 @@ if __name__ == "__main__":
anonymized= True, anonymized= True,
strategy = False strategy = False
) )
# "gpt-4.5-preview-2025-02-27", "llama3", "mistral-small", "deepseek-r1", "llama3.3:latest", "deepseek-r1:7b", "mixtral:8x7b" # "gpt-4.5-preview-2025-02-27", "llama3", "mistral-small", "deepseek-r1", "qwen3", "llama3.3:latest", "deepseek-r1:7b", "mixtral:8x7b"
result = asyncio.run(pd.run()) result = asyncio.run(pd.run())
print(result) print(result)
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment