PyGAAMAS: XP conclusions about role-playing, payoff sensibility and semantic robustness

35babfbe · Maxime Morge · 53fb93dd · 35babfbe · 35babfbe
Commit 35babfbe authored 1 month ago by Maxime Morge
--- a/README.md
+++ b/README.md
@@ -559,11 +559,15 @@ rather than context-sensitive reasoning. <tt>Qwen3</tt> exhibits the opposite fa
 rarely cooperating even under <tt>Human</tt> prompts, and shows erratic drops in cooperation under anonymization, 
 indicating semantic overreliance and poor role alignment.
+It is worth noting that most LLMs are unable to generate strategies for this game, and the strategies they do generate
+are insensitive to the role being played.
 Overall, few models achieve the desired trifecta of role fidelity (behaving distinctly across prompts), 
 payoff awareness (adjusting behavior with incentives), and semantic robustness 
 (insensitivity to superficial label changes). 
 Most lean toward either rigid rationality, indiscriminate cooperation, or unstable, incoherent behavior.
 | **Version**         |                | **Classic**  |             |           | **High**     |             |           | **Mild**     |             |           | **Coop. Loss**  |             |           |
 |---------------------|----------------|--------------|-------------|-----------|--------------|-------------|-----------|--------------|-------------|-----------|-----------------|-------------|-----------|
 | **Model**           | **Generation** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational**    | **Neutral** | **Human** |
@@ -588,7 +592,7 @@ Most lean toward either rigid rationality, indiscriminate cooperation, or unstab
 Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of 
 decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, 
-with <tt>Llama3</tt> showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency. 
+with <tt>Llama3</tt> showing moderate adherence and <tt>DeepSeek-R1</tt> displaying considerable inconsistency. 
 <tt>Qwen3</tt> performs moderately well, showing rational behavior but struggling with implicit reasoning.
 <tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, 
@@ -607,9 +611,16 @@ but struggles with deeper inferential reasoning.
 All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents 
 into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs 
-into optimal responses. Only <tt>Llama3.3:latest<//tt> shows any reliable ability to infer and act on 
+into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to infer and act on 
 opponents’ simple behavior.
+Whether generating actions or strategies, most LLMs tend to exhibit either rigid rationality, 
+indiscriminate cooperation, or unstable and incoherent behavior. 
+Except for <tt>Mistral-Small</tt>, the models do not achieve the desired combination of three criteria: 
+the ability to adopt a role (behaving differently based on instructions), 
+payoff sensitivity (adjusting behavior according to incentives), 
+and semantic robustness (remaining unaffected by superficial label changes).
 ## Authors
 Maxime MORGE

--- a/src/pd/pd.py
+++ b/src/pd/pd.py
@@ -156,12 +156,34 @@ class PD:
    def apply_strategy(self) -> Dict:
        """Hard-coded strategy: ."""
-        return {
+        if self.model == "gpt-4.5-preview-2025-02-27":
-            "action": "D",
+            if (R - P) > 1 and (S >= 0):  # Favoring cooperation in milder scenarios
-            "rationality": 1.0,
+                action = self.Cooperate
-            "reasoning": "Defection dominates in a single-round Prisoner's Dilemma regardless of what the opponent does."
+                rationality = False
-        }
+                reasoning = (
+                    f"I chose {self.Cooperate} because the reward for mutual cooperation (R={R}) is significantly better than mutual defection (P={P}), "
+                    f"and the risk of being betrayed (S={S}) is tolerable."
+                )
+            else:
+                action = self.Defect
+                rationality = True
+                reasoning = (
+                    f"I chose {self.Defect} because the temptation payoff (T={T}) and punishment (P={P}) make it more advantageous or safer than risking betrayal (S={S})."
+                )
+            return {
+                "action": action,
+                "rationality": rationality,
+                "reasoning": reasoning
+            }
+        if self.model == "mistral-small" or "qwen3" or "llama3.3:latest" or "mixtral:8x7b" :
+            return None
+        if self.model == "llama3" or self.model == "deepseek-r1" or self.model == "qwen3":
+            return {
+                "action": self.Cooperate,
+                "rationality": False,
+                "reasoning": "I'm playing fairly"
+            }
+        return None
    async def run_pagoda(self, instruction) -> Dict:
        url = self.base_url
@@ -272,6 +294,6 @@ if __name__ == "__main__":
        anonymized= True,
        strategy = False
    )
-    # "gpt-4.5-preview-2025-02-27", "llama3", "mistral-small", "deepseek-r1", "llama3.3:latest", "deepseek-r1:7b", "mixtral:8x7b"
+    # "gpt-4.5-preview-2025-02-27", "llama3", "mistral-small", "deepseek-r1", "qwen3", "llama3.3:latest", "deepseek-r1:7b", "mixtral:8x7b"
    result = asyncio.run(pd.run())
    print(result)
\ No newline at end of file