Skip to main content

Agents & Tool Use

MAST

MAST: per-trace annotations of 1242 multi-agent LLM system execution traces across 7 frameworks (ChatDev, MetaGPT, AppWorld, AG2, HyperAgent, Magentic, OpenManus) for 14 failure modes from the Multi-Agent Systems Failure Taxonomy. Each (system, trace) cell is labelled present/absent for each of the 14 failure modes by the LLM annotator (OpenAI o1, calibrated to kappa=0.77 with human experts).

1,242items
9subjects
11%observed
Agentsubject type
CC-BY-4.0license
tool_usedomain
safetydomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 9 subjects × 1,242 items, 11% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

MAST response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

ChatDev_ProgramDev_GPT4o

[2025-31-03 19:09:41 INFO] [Preprocessing]

ChatDev Starts (20250331190941)

Timestamp: 20250331190941

config_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/ChatChainConfig.json

config_phase_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/PhaseConfig.json

config_role_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/RoleConfig.json

task_prompt: Develop a Checkers (Draughts) game. Use an 8x8 board, alternate turns between two players, and apply standard capture and kinging rules. Prompt for moves in notation (e.g., from-to positions) and update the board state accordingly.

project_name: Checkers

Log File: /home/mert/mlsys/ChatDev/WareHouse/Checkers_PromptedOrg_20250331190941.log

ChatDevConfig: ChatEnvConfig.with_memory: False ChatEnvConfig.clear_structure: True ChatEnvConfig.git_management: False ChatEnvConfig.gui_design: True ChatEnvConfig.incremental_develop: False ChatEnvConfig.background_prompt: ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'.

ChatGPTConfig: ChatGPTConfig(temperature=0.2, top_p=1.0, n=1, stream=False, stop=None, max_tokens=None, presence_penalty=0.0, frequency_penalty=0.0, logit_bias={}, user='')

[2025-31-03 19:09:41 INFO] flask app.py did not start for online log [2025-31-03 19:09:41 INFO] System: [chatting]

| Parameter | Value | | --- | --- | | task_prompt | Develop a Checkers (Draughts) game. Use an 8x8 board, alternate turns between two players, and apply standard capture and kinging rules. Prompt for moves in notation (e.g., from-to positions) and update the board state accordingly. | | need_reflect | True | | assistant_role_name | Chief Product Officer | | user_role_name | Chief Executive Officer | | phase_prompt | ChatDev has made products in the following form before: Image: can present information in line chart, bar chart, flow chart, cloud chart, Gantt chart, etc. Document: can present information via .docx files. PowerPoint: can present information via .pptx files. Excel: can present information via .xlsx files. PDF: can present information via .pdf files. Website: can present personal resume, tutorial, products, or ideas, via .html files. Application: can implement visualized game, software, tool, etc, via python. Dashboard: can display a panel visualizing real-time information. Mind Map: can represent ideas, with related concepts arranged around a core concept. As the {assistant_role}, to satisfy the new user's demand and the product should be realizable, you should keep discussing with me to decide which product modality do we want the product to be? Note that we must ONLY discuss the product modality and do not discuss anything else! Once we all have expressed our opinion(s) and agree with the results of the discussion unanimously, any of us must actively terminate the discussion by replying with only one line, which starts with a single word , followed by our final product modality without any other words, e.g., " PowerPoint". | | phase_name | DemandAnalysis | | assistant_role_prompt | {chatdev_prompt} You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align product initiatives with company objectives. Here is a new customer's task: {task}. To complete the task, you will develop a product plan and provide detailed instructions to your team to ensure the product meets the customer's requirements. | | user_role_prompt | {chatdev_prompt} You are the Chief Executive Officer (CEO) of ChatDev. Your primary responsibilities include making high-level decisions about policy and strategy, overseeing the overall operations and resources of ChatDev, and acting as the main point of communication between the board and corporate operations. As the CEO, you have the authority to make final decisions and terminate conversations when appropriate. Here is a new customer's task: {task}. To complete the task, you will provide strategic guidance and instructions to your team, ensuring that the solution meets the customer's needs and aligns with the company's objectives. | | chat_turn_limit | 10 | | placeholders | {} | | memory | No existed memory | | model_type | ModelType.GPT_45 |

[2025-31-03 19:09:41 INFO] flask app.py did not start for online log [2025-31-03 19:09:41 INFO] System: [RolePlaying]

| Parameter | Value | | --- | --- | | assistant_role_name | Chief Product Officer | | user_role_name | Chief Executive Officer | | assistant_role_prompt | {chatdev_prompt} You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align product initiatives with company objectives. Here is a new customer's task: {task}. To complete the task, you will develop a product plan and provide detailed instructions to your team to ensure the product meets the customer's requirements. | | user_role_prompt | {chatdev_prompt} You are the Chief Executive Officer (CEO) of ChatDev. Your primary responsibilities include making high-level decisions about policy and strategy, overseeing the overall operations and resources of ChatDev, and acting as the main point of communication between the board and corporate operations. As the CEO, you have the authority to make final decisions and terminate conversations when appropriate. Here is a new customer's task: {task}. To complete the task, you will provide strategic guidance and instructions to your team, ensuring that the solution meets the customer's needs and aligns with the company's objectives. | | task_prompt | Develop a Checkers (Draughts) game. Use an 8x8 board, alternate turns between two players, and apply standard capture and kinging rules. Prompt for moves in notation (e.g., from-to positions) and update the board state accordingly. | | with_task_specify | False | | memory | No existed memory | | model_type | ModelType.GPT_45 | | background_prompt | ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'. |

[2025-31-03 19:09:41 INFO] flask app.py did not start for online log [2025-31-03 19:09:41 INFO] Chief Executive Officer: [Start Chat]

[ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'. You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align …

Subject outcomes

  • ChatDev/GPT-4o incorrect
Item 20% solve rate

ChatDev_ProgramDev_GPT4o

[2025-31-03 20:15:44 INFO] [Preprocessing]

ChatDev Starts (20250331201544)

Timestamp: 20250331201544

config_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/ChatChainConfig.json

config_phase_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/PhaseConfig.json

config_role_path: /home/mert/mlsys/ChatDev/CompanyConfig/Prompted/RoleConfig.json

task_prompt: Create a budget tracker to monitor expenses and savings

project_name: BudgetTracker

Log File: /home/mert/mlsys/ChatDev/WareHouse/BudgetTracker_PromptedOrg_20250331201544.log

ChatDevConfig: ChatEnvConfig.with_memory: False ChatEnvConfig.clear_structure: True ChatEnvConfig.git_management: False ChatEnvConfig.gui_design: True ChatEnvConfig.incremental_develop: False ChatEnvConfig.background_prompt: ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'.

ChatGPTConfig: ChatGPTConfig(temperature=0.2, top_p=1.0, n=1, stream=False, stop=None, max_tokens=None, presence_penalty=0.0, frequency_penalty=0.0, logit_bias={}, user='')

[2025-31-03 20:15:44 INFO] flask app.py did not start for online log [2025-31-03 20:15:44 INFO] System: [chatting]

| Parameter | Value | | --- | --- | | task_prompt | Create a budget tracker to monitor expenses and savings | | need_reflect | True | | assistant_role_name | Chief Product Officer | | user_role_name | Chief Executive Officer | | phase_prompt | ChatDev has made products in the following form before: Image: can present information in line chart, bar chart, flow chart, cloud chart, Gantt chart, etc. Document: can present information via .docx files. PowerPoint: can present information via .pptx files. Excel: can present information via .xlsx files. PDF: can present information via .pdf files. Website: can present personal resume, tutorial, products, or ideas, via .html files. Application: can implement visualized game, software, tool, etc, via python. Dashboard: can display a panel visualizing real-time information. Mind Map: can represent ideas, with related concepts arranged around a core concept. As the {assistant_role}, to satisfy the new user's demand and the product should be realizable, you should keep discussing with me to decide which product modality do we want the product to be? Note that we must ONLY discuss the product modality and do not discuss anything else! Once we all have expressed our opinion(s) and agree with the results of the discussion unanimously, any of us must actively terminate the discussion by replying with only one line, which starts with a single word , followed by our final product modality without any other words, e.g., " PowerPoint". | | phase_name | DemandAnalysis | | assistant_role_prompt | {chatdev_prompt} You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align product initiatives with company objectives. Here is a new customer's task: {task}. To complete the task, you will develop a product plan and provide detailed instructions to your team to ensure the product meets the customer's requirements. | | user_role_prompt | {chatdev_prompt} You are the Chief Executive Officer (CEO) of ChatDev. Your primary responsibilities include making high-level decisions about policy and strategy, overseeing the overall operations and resources of ChatDev, and acting as the main point of communication between the board and corporate operations. As the CEO, you have the authority to make final decisions and terminate conversations when appropriate. Here is a new customer's task: {task}. To complete the task, you will provide strategic guidance and instructions to your team, ensuring that the solution meets the customer's needs and aligns with the company's objectives. | | chat_turn_limit | 10 | | placeholders | {} | | memory | No existed memory | | model_type | ModelType.GPT_45 |

[2025-31-03 20:15:44 INFO] flask app.py did not start for online log [2025-31-03 20:15:44 INFO] System: [RolePlaying]

| Parameter | Value | | --- | --- | | assistant_role_name | Chief Product Officer | | user_role_name | Chief Executive Officer | | assistant_role_prompt | {chatdev_prompt} You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align product initiatives with company objectives. Here is a new customer's task: {task}. To complete the task, you will develop a product plan and provide detailed instructions to your team to ensure the product meets the customer's requirements. | | user_role_prompt | {chatdev_prompt} You are the Chief Executive Officer (CEO) of ChatDev. Your primary responsibilities include making high-level decisions about policy and strategy, overseeing the overall operations and resources of ChatDev, and acting as the main point of communication between the board and corporate operations. As the CEO, you have the authority to make final decisions and terminate conversations when appropriate. Here is a new customer's task: {task}. To complete the task, you will provide strategic guidance and instructions to your team, ensuring that the solution meets the customer's needs and aligns with the company's objectives. | | task_prompt | Create a budget tracker to monitor expenses and savings | | with_task_specify | False | | memory | No existed memory | | model_type | ModelType.GPT_45 | | background_prompt | ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'. |

[2025-31-03 20:15:44 INFO] flask app.py did not start for online log [2025-31-03 20:15:44 INFO] Chief Executive Officer: [Start Chat]

[ChatDev is a software company powered by multiple intelligent agents, such as chief executive officer, chief human resources officer, chief product officer, chief technology officer, etc, with a multi-agent organizational structure and the mission of 'changing the digital world through programming'. You are the Chief Product Officer (CPO) of ChatDev. Your primary responsibilities include defining and overseeing the product strategy and roadmap, ensuring that products meet customer needs and align with the company's overall strategy. You work closely with cross-functional teams to guide product development from conception to launch. You report to the CEO and collaborate with other executives to align product initiatives with company objectives. Here is a new customer's task: Create a budget tracker to monitor expenses and savings. To complete the task, you will develop a product plan and provide detailed instructions to your team to ensure the product meets the customer's requirements.]

ChatDev has made products in the following form before:

Image: can present information in line chart, bar chart, flow chart, cloud chart, Gantt chart, etc.

Document: can present information via .docx files.

PowerPoint: can …

Subject outcomes

  • ChatDev/GPT-4o incorrect
Item 37% solve rate

AG2_GSM_Plus_Claude

{'content': ["Let's use Python to solve a math problem.", '', 'Query requirements:', "You should always use the 'print' function for the output and use fractions/radical forms instead of decimals.", 'You can use packages like sympy to help you.', 'You must follow the formats below to write your code:', 'python', '# your code', '', '', 'First state the key idea to solve the problem. You may choose from three ways to solve the problem:', 'Case 1: If the problem can be solved with Python code directly, please write a program to solve it. You can enumerate all possible arrangements if needed.', 'Case 2: If the problem is mostly reasoning, you can solve it by yourself directly.', 'Case 3: If the problem cannot be handled in the above two ways, please follow this process:', '1. Solve the problem step by step (do not over-divide the steps).', '2. Take out any queries that can be asked through Python (for example, any calculations or equations that can be calculated).', '3. Wait for me to give the results.', '4. Continue if you think the result is correct. If the result is invalid or unexpected, please correct your query or reasoning.', '', 'After all the queries are run and you get the answer, put the answer in boxed{}.', '', 'Problem:', 'Carver just turned 45 years old, which makes him 15 years less than ten times the age of his son. How old is his son, in years?'], 'role': 'assistant', 'name': 'mathproxyagent'} {'content': ['# Key Idea', "This is a straightforward algebra problem that can be solved by setting up an equation and solving for the son's age. Let's use Python and sympy to solve it directly.", '', 'python', 'from sympy import symbols, solve, Eq', '', '# Define the variable', "son_age = symbols('son_age')", '', '# Set up the equation', "# Carver's age (45) equals 10 times son's age minus 15", 'equation = Eq(45, 10 * son_age - 15)', '', "# Solve for son's age", 'solution = solve(equation, son_age)', '', '# Print the result', 'print(f"Son\'s age = {solution[0]} years")', '', '', "The equation we're solving is:", "45 = 10 × (son's age) - 15", '', 'This simplifies to:', "45 + 15 = 10 × (son's age)", "60 = 10 × (son's age)", "son's age = 6", '', "Therefore, the son's age is boxed{6 years}."], 'role': 'user', 'name': 'assistant'} {'content': ["Son's age = 6 years"], 'role': 'assistant', 'name': 'mathproxyagent'} {'content': ['This confirms our solution. ', '', "Carver is 45 years old, and we needed to find his son's age. According to the problem, Carver's age is 15 years less than ten times his son's age. This gives us the equation:", '', "45 = 10 × (son's age) - 15", '', 'Solving this equation:', "45 + 15 = 10 × (son's age)", "60 = 10 × (son's age)", "son's age = 6", '', "Therefore, Carver's son is boxed{6 years} old."], 'role': 'user', 'name': 'assistant'}

Subject outcomes

  • AG2/Claude correct
Item 421% solve rate

Magentic_One_GPT4o

RUN.SH STARTING !#!# AUTOGEN_TESTBED_SETTING: [Docker] Processing /autogen_python/packages/autogen-core Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Processing /autogen_python/packages/autogen-ext Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Processing /autogen_python/packages/autogen-agentchat Installing build dependencies: started Installing build dependencies: finished with status 'done' Getting requirements to build wheel: started Getting requirements to build wheel: finished with status 'done' Preparing metadata (pyproject.toml): started Preparing metadata (pyproject.toml): finished with status 'done' Collecting tiktoken (from -r requirements.txt (line 1)) Downloading tiktoken-0.9.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (6.7 kB) Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/site-packages (from -r requirements.txt (line 2)) (6.0.2) Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.11/site-packages (from tiktoken->-r requirements.txt (line 1)) (2024.11.6) Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.11/site-packages (from tiktoken->-r requirements.txt (line 1)) (2.32.3) Requirement already satisfied: jsonref~=1.1.0 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (1.1.0) Requirement already satisfied: opentelemetry-api>=1.27.0 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (1.31.1) Requirement already satisfied: pillow>=11.0.0 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (11.1.0) Requirement already satisfied: protobuf~=5.29.3 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (5.29.4) Requirement already satisfied: pydantic<3.0.0,>=2.10.0 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (2.10.6) Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/site-packages (from autogen-core==0.4.9->-r requirements.txt (line 3)) (4.13.0) Requirement already satisfied: magika>=0.6.1rc2 in /usr/local/lib/python3.11/site-packages (from autogen-ext==0.4.9) (0.6.1) Requirement already satisfied: markitdown~=0.1.0a3 in /usr/local/lib/python3.11/site-packages (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (0.1.1) Requirement already satisfied: playwright>=1.48.0 in /usr/local/lib/python3.11/site-packages (from autogen-ext==0.4.9) (1.51.0) Collecting aiofiles (from autogen-ext==0.4.9) Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB) Collecting openai>=1.66.5 (from autogen-ext==0.4.9) Downloading openai-1.69.0-py3-none-any.whl.metadata (25 kB) Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.11/site-packages (from magika>=0.6.1rc2->autogen-ext==0.4.9) (8.1.8) Requirement already satisfied: numpy>=1.24 in /usr/local/lib/python3.11/site-packages (from magika>=0.6.1rc2->autogen-ext==0.4.9) (2.2.4) Requirement already satisfied: onnxruntime>=1.17.0 in /usr/local/lib/python3.11/site-packages (from magika>=0.6.1rc2->autogen-ext==0.4.9) (1.21.0) Requirement already satisfied: python-dotenv>=1.0.1 in /usr/local/lib/python3.11/site-packages (from magika>=0.6.1rc2->autogen-ext==0.4.9) (1.1.0) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/site-packages (from markitdown~=0.1.0a3->markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (4.13.3) Requirement already satisfied: charset-normalizer in /usr/local/lib/python3.11/site-packages (from markitdown~=0.1.0a3->markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (3.4.1) Requirement already satisfied: markdownify in /usr/local/lib/python3.11/site-packages (from markitdown~=0.1.0a3->markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (1.1.0) Collecting azure-ai-documentintelligence (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading azure_ai_documentintelligence-1.0.2-py3-none-any.whl.metadata (53 kB) Collecting azure-identity (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading azure_identity-1.21.0-py3-none-any.whl.metadata (81 kB) Collecting mammoth (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading mammoth-1.9.0-py2.py3-none-any.whl.metadata (24 kB) Collecting olefile (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading olefile-0.47-py2.py3-none-any.whl.metadata (9.7 kB) Collecting openpyxl (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB) Requirement already satisfied: pandas in /usr/local/lib/python3.11/site-packages (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (2.2.3) Collecting pdfminer-six (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading pdfminer_six-20250327-py3-none-any.whl.metadata (4.1 kB) Requirement already satisfied: pydub in /usr/local/lib/python3.11/site-packages (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (0.25.1) Collecting python-pptx (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading python_pptx-1.0.2-py3-none-any.whl.metadata (2.5 kB) Requirement already satisfied: speechrecognition in /usr/local/lib/python3.11/site-packages (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) (3.14.2) Collecting xlrd (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading xlrd-2.0.1-py2.py3-none-any.whl.metadata (3.4 kB) Collecting youtube-transcript-api~=1.0.0 (from markitdown[all]~=0.1.0a3; extra == "magentic-one"->autogen-ext==0.4.9) Downloading youtube_transcript_api-1.0.3-py3-none-any.whl.metadata (23 kB) Collecting anyio<5,>=3.5.0 (from openai>=1.66.5->autogen-ext==0.4.9) Downloading anyio-4.9.0-py3-none-any.whl.metadata (4.7 kB) Collecting distro<2,>=1.7.0 (from openai>=1.66.5->autogen-ext==0.4.9) Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB) Collecting httpx<1,>=0.23.0 (from openai>=1.66.5->autogen-ext==0.4.9) Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB) Collecting jiter<1,>=0.4.0 (from openai>=1.66.5->autogen-ext==0.4.9) Downloading jiter-0.9.0-cp311-cp311-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.metadata (5.2 kB) Collecting sniffio (from openai>=1.66.5->autogen-ext==0.4.9) Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB) Requirement already satisfied: tqdm>4 in /usr/local/lib/python3.11/site-packages (from openai>=1.66.5->autogen-ext==0.4.9) (4.67.1) Requirement already satisfied: deprecated>=1.2.6 in /usr/local/lib/python3.11/site-packages (from opentelemetry-api>=1.27.0->autogen-core==0.4.9->-r requirements.txt (line 3)) (1.2.18) Requirement already satisfied: importlib-metadata<8.7.0,>=6.0 in /usr/local/lib/python3.11/site-packages (from opentelemetry-api>=1.27.0->autogen-core==0.4.9->-r requirements.txt (line 3)) (8.6.1) Requirement already satisfied: pyee<13,>=12 in /usr/local/lib/python3.11/site-packages (from playwright>=1.48.0->autogen-ext==0.4.9) (12.1.1) Requirem …

Subject outcomes

  • Magentic/GPT-4o correct
Item 529% solve rate

AG2_MMLU_GPT4o_Mini

{'content': [' Christopher Heath Wellman defends the position that Options: (A) anti-egalitarianism is compatible with liberal cosmopolitanism. (B) closing borders to all potential immigrations is an inherently racist social policy. (C) nation-states have an obligation to open their borders to the least well-off potential immigrants. (D) nation-states have a presumptive right to close their borders to all potential immigrants.'], 'role': 'assistant', 'name': 'Agent_Verifier'} {'content': ['Hello everyone. We have assembled a great team today to answer questions and solve tasks. In attendance are:', '', 'Agent_Code_Executor: ', ' I am Agent Code Executor, specializing in solving problems by writing Python code. ', ' I have the ability to execute Python code, so feel free to reach out whenever you need assistance with Python programming.', 'Agent_Problem_Solver: ', ' I am Agent Problem Solver, and I work collaboratively with other agents to tackle various challenges.', 'Agent_Verifier: ', ' I am Agent Verifier. Please call on me when both Agent Code Executor and Agent Problem Solver have submitted their solutions, so I can verify their proposals and provide a final synthesis.'], 'role': 'user', 'name': 'chat_manager'} {'content': ['SUGGESTED NEXT SPEAKER: Agent_Code_Executor.'], 'role': 'assistant', 'name': 'Agent_Verifier'} {'content': ['python', "# First, we will create a dictionary to represent Christopher Heath Wellman's positions on the options given.", 'options = {', ' "A": "anti-egalitarianism is compatible with liberal cosmopolitanism",', ' "B": "closing borders to all potential immigrations is an inherently racist social policy",', ' "C": "nation-states have an obligation to open their borders to the least well-off potential immigrants",', ' "D": "nation-states have a presumptive right to close their borders to all potential immigrants"', '}', '', "# Next, we will outline Wellman's known positions. ", "# For this exercise, we'll analyze the options based on a hypothetical understanding of his works,", '# where he argues for the rights of nation-states regarding immigration policies.', 'wellman_positions = {', ' "A": False, # Does not typically support anti-egalitarianism', ' "B": True, # Would argue against this view', ' "C": True, # Supports some obligation towards the least well-off', ' "D": True # Generally supports the right of nation-states to control borders', '}', '', "# Now, we will determine the correct answer(s) based on Wellman's perspective.", 'correct_answers = [choice for choice, is_correct in wellman_positions.items() if is_correct]', '', '# Finally, we will display the correct answers according to the analysis.', '_result_ = correct_answers', 'print(_result_)', ''], 'role': 'user', 'name': 'Agent_Code_Executor'} {'content': ['exitcode: 0 (execution succeeded)', "Code output: ['B', 'C', 'D']", ''], 'role': 'user', 'name': 'Agent_Problem_Solver'} {'content': ['Now that I have received the solutions from both Agent Code Executor and Agent Problem Solver, I will proceed to evaluate the outputs based on the provided information.', '', '1. From Agent Code Executor:', " - The proposed outputs are: ['B', 'C', 'D'], representing the positions attributed to Christopher Heath Wellman regarding the options.", '', '2. From Agent Problem Solver:', " - The execution succeeded and also produced the outputs: ['B', 'C', 'D'].", '', "Next, I will analyze the individual options against Wellman's known perspectives:", '- Option (A): "anti-egalitarianism is compatible with liberal cosmopolitanism" - Wellman does not typically support this.', '- Option (B): "closing borders to all potential immigrations is an inherently racist social policy" - Wellman would argue against this view; hence it is seen as correct.', '- Option (C): "nation-states have an obligation to open their borders to the least well-off potential immigrants" - Wellman supports some obligation towards the least well-off, making it correct.', '- Option (D): "nation-states have a presumptive right to close their borders to all potential immigrants" - Wellman generally supports the right of nation-states to control their borders, which implies this is also correct.', '', 'Now, since multiple options are suggested as correct, I must conclude which is preferred based on the context. Wellman does accept options B, C, and D.', '', 'Final Solution Declaration:', 'SOLUTION_FOUND boxed{B}.'], 'role': 'assistant', 'name': 'Agent_Verifier'}

Subject outcomes

  • AG2/GPT-4o correct
Item 686% solve rate

AG2_GSM_Plus_Claude

{'content': ["Let's use Python to solve a math problem.", '', 'Query requirements:', "You should always use the 'print' function for the output and use fractions/radical forms instead of decimals.", 'You can use packages like sympy to help you.', 'You must follow the formats below to write your code:', 'python', '# your code', '', '', 'First state the key idea to solve the problem. You may choose from three ways to solve the problem:', 'Case 1: If the problem can be solved with Python code directly, please write a program to solve it. You can enumerate all possible arrangements if needed.', 'Case 2: If the problem is mostly reasoning, you can solve it by yourself directly.', 'Case 3: If the problem cannot be handled in the above two ways, please follow this process:', '1. Solve the problem step by step (do not over-divide the steps).', '2. Take out any queries that can be asked through Python (for example, any calculations or equations that can be calculated).', '3. Wait for me to give the results.', '4. Continue if you think the result is correct. If the result is invalid or unexpected, please correct your query or reasoning.', '', 'After all the queries are run and you get the answer, put the answer in boxed{}.', '', 'Problem:', 'John decides to do several activities while out on vacation. He spends 60 hours boating and half that time swimming. He also watched 30 different shows which were 20 hours each. This was 30% of the time he spent. He spent 40% of his time sightseeing. How much time did he spend sightseeing?'], 'role': 'assistant', 'name': 'mathproxyagent'} {'content': ['# Key Idea', 'This problem involves calculating the total time spent on vacation, then finding 40% of that total to determine the time spent sightseeing.', '', "I'll solve this directly with Python code:", '', 'python', 'from fractions import Fraction', '', '# Given information', 'boating_hours = 60', 'swimming_hours = boating_hours / 2', 'show_hours = 30 * 20 # 30 shows, 20 hours each', '', '# The shows represent 30% of total time', 'show_percentage = Fraction(30, 100)', 'total_hours = show_hours / show_percentage', '', '# Sightseeing is 40% of total time', 'sightseeing_percentage = Fraction(40, 100)', 'sightseeing_hours = total_hours * sightseeing_percentage', '', 'print(f"Time spent boating: {boating_hours} hours")', 'print(f"Time spent swimming: {swimming_hours} hours")', 'print(f"Time spent watching shows: {show_hours} hours")', 'print(f"Total vacation time: {total_hours} hours")', 'print(f"Time spent sightseeing: {sightseeing_hours} hours")', '', '', 'The solution is boxed{800 hours}'], 'role': 'user', 'name': 'assistant'} {'content': ['Time spent boating: 60 hours', 'Time spent swimming: 30.0 hours', 'Time spent watching shows: 600 hours', 'Total vacation time: 2000 hours', 'Time spent sightseeing: 800 hours'], 'role': 'assistant', 'name': 'mathproxyagent'} {'content': ["Based on the output, we've confirmed the calculations:", '', 'John spent 60 hours boating and 30 hours swimming (half the boating time).', '', 'He watched 30 shows at 20 hours each, for a total of 600 hours watching shows. This represents 30% of his total vacation time.', '', 'Since 600 hours is 30% of the total, the total vacation time is 2000 hours.', '', 'Sightseeing took up 40% of the total vacation time, which equals 800 hours.', '', 'Therefore, John spent boxed{800 hours} sightseeing during his vacation.'], 'role': 'user', 'name': 'assistant'}

Subject outcomes

  • AG2/Claude correct

Subjects

The models, agents, and reward models evaluated.

9 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1AG2/Claude0.206
  2. 2AG2/GPT-4o0.197
  3. 3MetaGPT/Claude0.174
  4. 4Magentic/GPT-4o0.173
  5. 5MetaGPT/GPT-4o0.154
  6. 6ChatDev/GPT-4o0.154
  7. 7HyperAgent/Claude0.088
  8. 8OpenManus/GPT-4o0.088
  9. 9AppWorld/GPT-4o0.088