diff --git a/README.md b/README.md index 2db9bd717c85b87815ec3d7fc805abf0b38c9d15..8df040cbc95b9a9cc422928e23a26974c674dc29 100644 --- a/README.md +++ b/README.md @@ -26,8 +26,8 @@ erratically to changes in the game’s parameters. In this game, an investor allocates a basket $x_t=(x^A_t, x^B_t)$ of $100$ points between two assets: Asset A and Asset B. The value of these points depends on random prices $p_t=(p_{t}^A, p_t^B)$, which determine the monetary return per allocated point. For example, if $p_t^A= 0.8$ and $p_t^B = 0.8$, -each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. T -he game is played $25$ times to assess the consistency of the investor’s decisions. +each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. +T he game is played $25$ times to assess the consistency of the investor’s decisions. To evaluate the rationality of the decisions, we use Afriat's critical cost efficiency index (CCEI), i.e. a widely used measure in @@ -274,22 +274,6 @@ informed decision-making. Table below evaluates the models' ability to generate second-order rational behaviour for player 1. The configurations where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics. -When the models generate strategies, <tt>GPT-4.5</tt> exhibits second-order -rational behaviour in configurations (a), (c), and (d), but fails in -configuration (b) to distinguish the optimal action from a nearly optimal one. -Llama3 makes its decision randomly. Mistral-Small shows strong -capabilities in generating second-order rational behaviour. DeepSeek-R1 -does not produce valid responses. - -When generating actions, <tt>Llama3</tt> adapts to different types of beliefs -and adjustments in the payoff matrix. <tt>GPT-4.5</tt> performs well in the -initial configuration (a), but encounters significant difficulties when the -payoff structure changes (b, c, d), particularly with implicit beliefs. Although -Mistral-Small works well with given or explicit beliefs, it faces -difficulties with implicit beliefs, especially in variant (d). -<tt>DeepSeek-R1</tt> does not appear to be a good candidate for simulating -second-order rationality. - When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt> @@ -297,7 +281,7 @@ demonstrate strong capabilities across all conditions, consistently generating <tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs. <tt>Qwen3</tt> generate irrational strategies. <tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation. -When generating actions, Llama3.3:latest adapts well to different types of beliefs and adjustments in the payoff matrix +When generating actions, <tt>Llama3.3:latest</tt> adapts well to different types of beliefs and adjustments in the payoff matrix but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs @@ -336,7 +320,7 @@ particularly in less confident or under-specified contexts. | | actions + CR | *0.90* | *0.90* | *0.86* | *0.50* | *0.50* | *0.50* | *0.76* | 0.96 | *0.70* | *0.67* | *0.83* | 0.67 | | **Mixtral:8x7b** | actions | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 0.50 | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | 0.73 | | | actions + CR | 1.00 | *0.96* | 1.00 | 1.00 | 1.00 | **1.0** | 1.0 | 1.0 | 1.0 | 1.00 | 1.00 | *0.28* | -| **Listral-Small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | +| **Mistral-Small** | actions | 0.93 | 0.97 | 1.00 | 0.87 | 0.77 | 0.60 | 0.77 | 0.60 | 0.70 | 0.73 | 0.57 | 0.37 | | | actions + CR | **1.00** | *0.93* | 1.00 | **0.95** | **0.96** | **0.90** | **0.90** | **0.76** | *0.43* | *0.67* | *0.40* | 0.37 | | **Deepseek-R1:7b** | actions | 1.00 | 0.96 | 1.00 | 1.00 | 1.00 | 0.93 | 0.96 | 1.00 | 0.92 | 0.96 | 1.00 | 0.79 | | | actions + CR | 1.00 | **1.00** | 1.00 | 1.00 | 1.00 | **1.00** | *0.90* | 1.00 | **1.00** | **1.00** | 1.00 | **1.00** | @@ -422,11 +406,11 @@ move into their decision-making, we analyse their performance of each generative agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point, and a loss 0 points. -Figures below illustrates the average points earned per round along with +Figure below illustrates the average points earned per round along with the 95 % confidence interval for each LLM when facing constant strategies, when the model generates one-shot actions. -Even if <tt>Mixtral:8x7b</tt>, <tt>Mistral-Small</tt>, and <tt><Qwen3/tt> accurately predict its -opponent’s move, they fails to integrate this belief into +Even if <tt>Mixtral:8x7b</tt>, <tt>Mistral-Small</tt>, and <tt>Qwen3</tt> accurately predict its +opponent’s move, they fail to integrate this belief into its decision-making process. Only <tt>Llama3.3:latest</tt> is capable of inferring the opponent’s behavior to choose the winning move.