diff --git a/README.md b/README.md index 2ff4373573a95fbd2a8100959aee42965dab2fcc..14ea787609fb9b90c0c8ed58e9a0717bbc06f51c 100644 --- a/README.md +++ b/README.md @@ -3,65 +3,62 @@ Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate the social behaviors of LLM-based agents. -This prototype allows to analyse the potential of Large Language Models (LLMs) for -social simulation by assessing their ability to: (a) make decisions aligned -with explicit preferences; (b) adhere to principles of rationality; and (c) -refine their beliefs to anticipate the actions of other agents. Through -game-theoretic experiments, we show that certain models, such as -\texttt{GPT-4.5} and \texttt{Mistral-Small}, exhibit consistent behaviours in -simple contexts but struggle with more complex scenarios requiring -anticipation of other agents' behaviour. Our study outlines research -directions to overcome the current limitations of LLMs. - -## Consistency - -To evaluate the decision-making consistency of various LLMs, we introduce an investment -game designed to test whether these models follow stable decision-making patterns or -react erratically to changes in the game’s parameters. - -In the game, an investor allocates a basket \((p_t^A, p_t^B)\) of 100 points between two assets: -Asset A and Asset B. The value of these points depends on two random parameters \((a_t, b_t)\), -which determine the monetary return per allocated point. - -For example, if \(a_t = 0.8\) and \(b_t = 0.5\), each point assigned to Asset A is worth $0.8, -while each point allocated to Asset B yields $0.5. The game is played 25 times to assess -the consistency of the investor’s decisions. - -To evaluate the rationality of the decisions, we use the **Critical Cost Efficiency Index (CCEI)**, -a widely used measure in experimental economics and behavioral sciences. The CCEI assesses -whether choices adhere to the **Generalized Axiom of Revealed Preference (GARP)**, -a fundamental principle of rational decision-making. - -If an individual violates rational choice consistency, -the CCEI determines the minimal budget adjustment required to make their -decisions align with rationality. Mathematically, the budget for each basket is calculated as: - -\[ -I_t = p_t^A \times a_t + p_t^B \times b_t -\] - -The CCEI is derived from observed decisions by solving a linear optimization -problem that finds the largest \(\lambda\) (where \(0 \leq \lambda \leq 1\)) -such that for every observation, the adjusted decisions satisfy the rationality constraint: - -\[ -p^_t \cdot x_s \leq \lambda I_t -\] - -This means that if we slightly reduce the budget (multiplying it by \(\lambda\)), -the choices will become consistent with rational decision-making. -A CCEI close to 1 indicates high rationality and consistency with economic theory. -A low CCEEI** suggests irrational or inconsistent decision-making. - -To ensure response consistency, each model undergoes 30 iterations of the game -with a fixed temperature of 0.0. - -The results indicate significant differences in decision-making consistency among the evaluated models. -Mistral-Small demonstrates the highest level of rationality, with CCEI values consistently above 0.75. -Llama 3 performs moderately well, with CCEI values ranging between 0.2 and 0.74. -DeepSeek R1 exhibits inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83 - - +This prototype explores the potential of *homo silicus* for social +simulation. We examine the behaviour exhibited by intelligent +machines, particularly how generative agents deviate from +the principles of rationality. To assess their responses to simple human-like +strategies, we employ a series of tightly controlled and theoretically +well-understood games. Through behavioral game theory, we evaluate the ability +of <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Mistral-Small</tt>}, and +<tt>DeepSeek-R1</tt> to make coherent one-shot +decisions, generate algorithmic strategies based on explicit preferences, adhere +to first- and second-order rationality principles, and refine their beliefs in +response to other agents’ behaviours. + + +## Economic Rationality + +## Evaluating Economic Rationality in LLMs + +To evaluate the economic rationality of various LLMs, we introduce an investment game +designed to test whether these models follow stable decision-making patterns or react +erratically to changes in the game’s parameters. + +In this game, an investor allocates a basket $x_t=(x^A_t, x^B_t)$ of $100$ points between +two assets: Asset A and Asset B. The value of these points depends on random prices $p_t=(p_{t}^A, p_t^B)$, +which determine the monetary return per allocated point. For example, if $p_t^A= 0.8$ and $p_t^B = 0.8$, +each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. T +he game is played $25$ times to assess the consistency of the investor’s decisions. + +To evaluate the rationality of the decisions, we use Afriat's +critical cost efficiency index (CCEI), i.e. a widely used measure in +experimental economics. The CCEI assesses whether choices adhere to the +generalized axiom of revealed preference (GARP), a fundamental principle of +rational decision-making. If an individual violates rational choice consistency, +the CCEI determines the minimal budget adjustment required to make their +decisions align with rationality. Mathematically, the budget for each basket is +calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is +derived from observed decisions by solving a linear optimization problem that +finds the largest $\lambda$, where $0 \leq \lambda \leq 1$, such that for every +observation, the adjusted decisions satisfy the rationality constraint: $p_t +\cdot x_t \leq \lambda I_t$. This means that if we slightly reduce the budget, +multiplying it by $\lambda$, the choices will become consistent with rational +decision-making. A CCEI close to 1 indicates high rationality and consistency +with economic theory. A low CCEEI suggests irrational or inconsistent +decision-making. + +To ensure response consistency, each model undergoes $30$ iterations of the game +with a fixed temperature of $0.0$. The results shown in +Figure below highlight significant differences in decision-making +consistency among the evaluated models. <tt>GPT-4.5</tt>, <tt>LLama3.3:latest</tt> +and <tt>DeepSeek-R1:7b</tt> stand out with a +perfect CCEI score of 1.0, indicating flawless rationality in decision-making. +<tt>Mistral-Small</tt> and <tt>Mixtral:8x7b</tt> demonstrate the next highest level of rationality. +<tt>Llama3</tt> performs moderately well, with CCEI values ranging between 0.2 and 0.74. +<tt>DeepSeek-R1</tt> exhibits +inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83. + +