From 6013a082918b618f38b959a016efd8ddbae38206 Mon Sep 17 00:00:00 2001 From: mmorge <maxime.morge@univ-lyon1.fr> Date: Sat, 3 May 2025 17:15:14 +0200 Subject: [PATCH] PyGAAMAS: Update synthesis with Qwen3 outcome --- README.md | 35 +++++++++++++++++++---------------- 1 file changed, 19 insertions(+), 16 deletions(-) diff --git a/README.md b/README.md index 5c47ce9..2db9bd7 100644 --- a/README.md +++ b/README.md @@ -438,26 +438,29 @@ incorporate other agents’ actions into their decision-making. ## Synthesis -Our findings reveal notable differences in the cognitive capabilities of LLMs -across multiple dimensions of decision-making. -<tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, -with <tt>Llama3</tt> showing moderate adherence and </tt>DeepSeek-R1</tt> displaying considerable inconsistency. +Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of +decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, +with <tt>Llama3</tt> showing moderate adherence and DeepSeek-R1 displaying considerable inconsistency. +<tt>Qwen3</tt> performs moderately well, showing rational behavior but struggling with implicit reasoning. <tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, -particularly when generating algorithmic strategies rather than isolated one-shot actions. -These models tend to struggle more with one-shot decision-making, where responses are less structured and -more prone to inconsistency. In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and -performs poorly in aligning actions with specified preferences. -<tt>GPT-4.5</tt> and <tt>Mistral-Small</tt> consistently display rational behavior at both first- and second-order levels. -<tt>Llama3</tt>, although prone to random behavior when generating strategies, adapts more effectively in one-shot -decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly in both strategic and one-shot formats, rarely -exhibiting coherent rationality. +particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to +struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency. +In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and performs poorly in aligning actions with +specified preferences. <tt>Qwen3</tt> aligns well with utilitarian preferences and moderately with altruistic +ones but struggles with egoistic and egalitarian preferences. + +<tt>GPT-4.5</tt> and </tt>Mistral-Small</tt> consistently display rational behavior at both +first- and second-order levels. <tt>Llama3<tt>, although prone to random behavior when generating strategies, +adapts more effectively in one-shot decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly +in both strategic and one-shot formats, rarely exhibiting coherent rationality. <tt>Qwen3</tt> shows strong +first-order rationality when producing actions, especially under explicit or guided conditions, +but struggles with deeper inferential reasoning. All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents -into their own decisions. Despite some being able to identify patterns, -most fail to translate these beliefs into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to -infer and act on opponents’ simple behaviour - +into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs +into optimal responses. Only <tt>Llama3.3:latest<//tt> shows any reliable ability to infer and act on +opponents’ simple behavior. ## Authors Maxime MORGE -- GitLab