Reasoning & Knowledge

WorldCentralBanks

WorldCentralBanks: 25,000 annotated sentences from 26 central banks for Stance Detection (Hawkish/Dovish/Neutral/Irrelevant), Temporal Classification, and Uncertainty (Certainty) Estimation. This build holds per-(model, sentence) zero-shot/few-shot LLM predictions released in the upstream repo, scored as binary correctness (predicted label vs the gold annotation), with the gold annotation as correct_answer.

9,575items

10subjects

100%observed

CC-BY-NC-SA-4.0license

financedomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 9,575 items, 100% of cells evaluated.

WorldCentralBanks response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rateanswer: hawkish

it is, however, recognised that the downside risks to growth remain high

How subjects answered

Llama-3-70b-chat incorrect
ModelResponse(id='nscCssn-4yUbBN-93a2d192afe02d0a', created=1746307397, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The sentence contains the phrase 'remain high', which implies a degree of uncertainty about the extent of the downside risks to growth."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=39, prompt_tokens=446, total_tokens=485, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])
FinMA-7B-full incorrect

gemini-2.0-flash incorrect

{
  "label": "uncertain",
  "justification": "The phrase \"downside risks to growth remain high\" indicates a possibility of negative outcomes, reflecting uncertainty about future economic performance."
}

DeepSeek-V3 incorrect
ModelResponse(id='nqNqY6M-zqrih-931733956f66c943', created=1744843405, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "uncertain",\n"justification": "The use of \'remain high\' indicates a lack of definitive certainty about the extent of the downside risks."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=40, prompt_tokens=151, total_tokens=191, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
Llama-4-Maverick incorrect
ModelResponse(id='nq4jkzR-zqrih-9303aafdcb79adc3', created=1744638581, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, we need to analyze its content and tone to determine whether it conveys certainty or uncertainty.\n\nThe sentence is: "it is, however, recognised that the downside risks to growth remain high."\n\nLet's break it down:\n\n1. Label Assignment: The sentence implies that there are acknowledged risks associated with growth, which suggests a level of uncertainty about future outcomes.\n\n2. …
Llama-4-Scout incorrect
ModelResponse(id='nq3zcm4-3NKUce-9302dd7cc95a44e1', created=1744630163, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "uncertain",\n "justification": "The sentence expresses a possibility of downside risks to growth remaining high, indicated by the word \'recognised\' and \'remain high\', which implies uncertainty."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=54, prompt_tokens=161, total_tokens=215, …

Item 239% solve rateanswer: hawkish

while pivot risk on the inflation side has significantly decreased, pivot risk on the financial stability side stemming from the expansion of housing price growth has increased.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nscqSaa-4yUbBN-93a37fbe8dc005a2', created=1746314531, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence uses phrases like 'has significantly decreased' and 'has increased', which indicate a clear and definite assessment, making it certain."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=41, prompt_tokens=450, total_tokens=491, completion_tokens_details=None, prompt_tokens_details=None), …
gpt-4.1-2025-04-14 correct
ModelResponse(id='chatcmpl-BN2R8XsfZzHwbhMCsgb21Rl9nkwFR', created=1744830610, model='gpt-4.1-2025-04-14', object='chat.completion', system_fingerprint='fp_cf18407276', choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The sentence discusses potential risks ('pivot risk') on both inflation and financial stability sides, indicating uncertainty about future outcomes."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]))], usage=Usage(completion_tokens=38, prompt_tokens=173, total_tokens=211, …
Llama-4-Scout correct
ModelResponse(id='nq5n6vm-2j9zxn-9304ca6fdabdb06c', created=1744650355, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "uncertain",\n "justification": "The sentence expresses a conditional or subjective assessment of risks, using phrases like \'has significantly decreased\' and \'has increased\', indicating uncertainty."\n}\n\n\nI classified the sentence as "uncertain" because it involves subjective evaluations of risk trends ("significantly decreased" and "has increased") rather than stating …

gemini-2.0-flash incorrect

{
  "label": "uncertain",
  "justification": "The sentence discusses increasing and decreasing risks, indicating a lack of clear direction and therefore uncertainty about future policy decisions."
}

Qwen2.5-72B-Instruct-Turbo incorrect
ModelResponse(id='npmNpEP-4yUbBN-92f0ee6f29d14520', created=1744442114, model='together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n "label": "uncertain",\n "justification": "The sentence indicates a shift in risks, suggesting that while inflation risks have decreased, financial stability risks have increased, which implies an ongoing and evolving uncertainty in the economic environment."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=51, prompt_tokens=197, total_tokens=248, …
Llama-4-Maverick incorrect
ModelResponse(id='nq5o6Ti-zqrih-9304cf29e843e5b4', created=1744650548, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and implications.\n\nThe sentence is: "while pivot risk on the inflation side has significantly decreased, pivot risk on the financial stability side stemming from the expansion of housing price growth has increased."\n\n1. Label Assignment: The sentence discusses changes in "pivot risk" related to inflation and financial stability. It indicates a shift from one …

Item 347% solve rateanswer: dovish

all the board members agreed that, as the downward risks in the local economy and particularly in inflation had intensified, the valid options for this meeting should be to either hold the mpr at 2.5% with an expansionary bias, or reduce it by 25 bp, also with an expansionary bias.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsdYidn-4yUbBN-93a4445d8e3881c1', created=1746322585, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states the board members' agreement on the valid options for the meeting without using speculative language, indicating a certain outcome."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=39, prompt_tokens=520, total_tokens=559, completion_tokens_details=None, prompt_tokens_details=None), …

gemini-2.0-flash correct

{
  "label": "certain",
  "justification": "The sentence clearly states the options considered by the board members, indicating a certain understanding of their potential actions."
}

Qwen2.5-72B-Instruct-Turbo correct
ModelResponse(id='npq4yPX-4yUbBN-92f4ea92ecb62040', created=1744483899, model='together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n "label": "uncertain",\n "justification": "The sentence indicates that the board members are considering two possible actions, which introduces an element of uncertainty about the final decision."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=41, prompt_tokens=233, total_tokens=274, completion_tokens_details=None, prompt_tokens_details=None), …
DeepSeek-V3 correct
ModelResponse(id='nqRjFmL-zqrih-931a52819ee3206c', created=1744876129, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "certain",\n"justification": "The sentence clearly states the agreed-upon options for the monetary policy meeting without any hedging or ambiguous language."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=39, prompt_tokens=201, total_tokens=240, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, …
FinMA-7B-full incorrect
Llama-4-Maverick incorrect
ModelResponse(id='nq75jpu-28Eivz-930632baaa96452d', created=1744665112, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and context.\n\nThe sentence discusses the Central Bank of Chile's monetary policy meeting and the considerations or options the board members agreed upon regarding the Monetary Policy Rate (MPR). The key elements to focus on are the tone, the language used, and whether the sentence conveys a definitive action or a speculative/conditional one.\n\nThe sentence states …

Item 454% solve rateanswer: neutral

global economic activity continues to recover from the covid19 pandemic, although growth remains uneven across regions, as the pace of vaccinations varies across countries.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsdgEHY-4yUbBN-93a468251b31e610', created=1746324050, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The sentence expresses ambiguity about the pace of global economic recovery, using phrases such as 'remains uneven' and 'varies across countries', indicating uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=45, prompt_tokens=502, total_tokens=547, completion_tokens_details=None, …
FinMA-7B-full correct
Llama-4-Scout correct
ModelResponse(id='nq7p2ec-2j9zxn-9306fc2d0e9fb05b', created=1744673364, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "uncertain",\n "justification": "The sentence contains words and phrases that indicate uncertainty, such as \'although\', \'remains uneven\', and \'varies across countries\', which suggest a lack of absolute confidence in the statement."\n}\n\n\nI classified the sentence as "uncertain" because it contains words and phrases that indicate a degree of uncertainty or variability, which …
gpt-4.1-mini incorrect
ModelResponse(id='chatcmpl-BMms7Tli068w54xf8hNw6NdpR4Ii7', created=1744770779, model='gpt-4.1-mini-2025-04-14', object='chat.completion', system_fingerprint='fp_79b79be41f', choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states observed facts about global economic recovery and vaccination pace without expressing doubt or prediction."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]))], usage=Usage(completion_tokens=32, prompt_tokens=173, total_tokens=205, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, …
DeepSeek-V3 incorrect
ModelResponse(id='nqRxitw-zqrih-931a92a80b7bff28', created=1744878757, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The use of 'although' and 'varies' introduces conditional and variable elements, indicating uncertainty about the uniformity of global economic recovery."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=42, prompt_tokens=166, total_tokens=208, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), …
Llama-4-Maverick incorrect
ModelResponse(id='nq7qgL1-zqrih-930704091f26b04f', created=1744673686, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and tone.\n\nThe sentence is: "global economic activity continues to recover from the covid19 pandemic, although growth remains uneven across regions, as the pace of vaccinations varies across countries."\n\n1. Label Assignment: The sentence contains elements that indicate a level of uncertainty. While it states that global economic activity "continues to recover," …

Item 558% solve rateanswer: dovish

some sectors faced growing difficulties in sourcing inputs, which was disrupting production.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nseEViN-4yUbBN-93a501d7f8e822ef', created=1746330346, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states a fact about some sectors facing difficulties in sourcing inputs and its impact on production, without expressing any doubt or uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=40, prompt_tokens=447, total_tokens=487, completion_tokens_details=None, prompt_tokens_details=None), …
FinMA-7B-full incorrect

gemini-2.0-flash incorrect

{
  "label": "certain",
  "justification": "The sentence describes a factual observation ("faced growing difficulties") about specific sectors, indicating certainty."
}

DeepSeek-V3 incorrect
ModelResponse(id='nqTeDUc-zqrih-931c631029c9c9b7', created=1744897779, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "uncertain",\n"justification": "The use of \'some sectors\' and \'growing difficulties\' introduces vagueness, making the statement non-definitive."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=45, prompt_tokens=151, total_tokens=196, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
Llama-4-Maverick incorrect
ModelResponse(id='nq93WU5-zqrih-930850ab388de647', created=1744687310, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze it step by step.\n\nThe sentence is: "some sectors faced growing difficulties in sourcing inputs, which was disrupting production."\n\n1. Understanding the task: We need to determine if the sentence is certain or uncertain regarding the European Central Bank's monetary policy meeting. Although the sentence itself doesn't directly mention the ECB or its meeting, we can infer …
Llama-4-Scout incorrect
ModelResponse(id='nq92ed8-6UHjtw-93084c934d9881a3', created=1744687143, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "certain",\n "justification": "The sentence reports a specific fact about sectors facing difficulties in sourcing inputs, using the past tense and a definitive tone, indicating a certain event."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=51, prompt_tokens=162, total_tokens=213, …

Item 661% solve rateanswer: hawkish

in contrast, the halifax house price index had risen by 2.2% in january.

How subjects answered

Llama-3-70b-chat incorrect
ModelResponse(id='nscZDXh-4yUbBN-93a332750efdd6f7', created=1746311365, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence reports a specific and exact percentage change in the Halifax house price index, indicating a reliable and predictable outcome."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=37, prompt_tokens=457, total_tokens=494, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])
FinMA-7B-full incorrect

gemini-2.0-flash incorrect

{
  "label": "certain",
  "justification": "The sentence reports a specific, quantifiable event (the Halifax house price index increase) with a stated value and timeframe, indicating certainty."
}

DeepSeek-V3 incorrect
ModelResponse(id='nqPYkoZ-zqrih-9317f7ee2a3fd6df', created=1744851448, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "certain",\n"justification": "The sentence states a specific and verifiable fact (2.2% rise in January) without any hedging or ambiguity."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=42, prompt_tokens=158, total_tokens=200, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
Llama-4-Maverick incorrect
ModelResponse(id='nq5Nf2L-zqrih-93045abf5e26c974', created=1744645780, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, we need to analyze its content and tone to determine if it conveys certainty or uncertainty regarding the Bank of England's monetary policy meeting. The sentence provided is: "in contrast, the halifax house price index had risen by 2.2% in january."\n\n## Step 1: Understand the sentence\nThe sentence reports a specific fact about the Halifax House Price Index, stating it "had risen by 2.2% in …
Llama-4-Scout incorrect
ModelResponse(id='nq5MUP8-2j9zxn-9304551d2bba38a9', created=1744645549, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "certain",\n "justification": "The sentence reports a specific, factual data point about the Halifax house price index, indicating a clear and definite change."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=46, prompt_tokens=167, total_tokens=213, completion_tokens_details=None, …

Item 764% solve rateanswer: dovish

that said, there might still be a perception that the euro-area authorities were now more committed to shifting the burden of future bank restructurings from governments to uninsured private creditors.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nscX34u-4yUbBN-93a328120b6c0594', created=1746310940, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The sentence expresses a possibility or perception, using the phrase 'might still be', which indicates ambiguity and uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=36, prompt_tokens=472, total_tokens=508, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])
FinMA-7B-full incorrect

gemini-2.0-flash incorrect

{
  "label": "uncertain",
  "justification": "The sentence uses hedging language like 'might still be a perception,' indicating uncertainty rather than a definitive statement of fact."
}

DeepSeek-V3 incorrect
ModelResponse(id='nqPUycx-zqrih-9317e5edc8e672ec', created=1744850710, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "uncertain",\n"justification": "The use of \'might\' and \'perception\' indicates a lack of definitive certainty in the statement."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=40, prompt_tokens=174, total_tokens=214, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
Llama-4-Maverick incorrect
ModelResponse(id='nq5DrUW-zqrih-930430d1f823e604', created=1744644063, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and tone.\n\nThe sentence is: "that said, there might still be a perception that the euro-area authorities were now more committed to shifting the burden of future bank restructurings from governments to uninsured private creditors."\n\n1. Label Assignment: The sentence contains the word "might," which is a modal verb often used to express possibility or …
Llama-4-Scout incorrect
ModelResponse(id='nq5CvRQ-2j9zxn-93042c61a9b56ffe', created=1744643881, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "uncertain",\n "justification": "The sentence contains words like \'might\' and \'perception\', indicating a degree of uncertainty or subjectivity."\n}\n\n\nThe presence of "might" and "perception" suggests that the sentence is expressing a possibility or an interpretation, rather than a statement of fact, which is why I classified it as uncertain.', role='assistant', …

Item 867% solve rateanswer: neutral

he/she argued that it will be important to monitor that the effects of this stance are fully transmitted to the economy.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsd5nrZ-4yUbBN-93a3c4192a4fd6e4', created=1746317331, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "uncertain",\n"justification": "The sentence expresses uncertainty about the transmission of effects to the economy, using the phrase 'it will be important to monitor', which implies a lack of confidence in the outcome."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=48, prompt_tokens=426, total_tokens=474, completion_tokens_details=None, …
FinMA-7B-full correct

gemini-2.0-flash correct

{
  "label": "uncertain",
  "justification": "The sentence expresses a need to monitor the *effects* of a policy, indicating uncertainty about whether those effects will be fully realized or not."
}

DeepSeek-V3 correct
ModelResponse(id='nqQReh5-zqrih-9318ea5d6985d6a3', created=1744861378, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "uncertain",\n"justification": "The use of \'argued\' and \'will be\' indicates a subjective opinion rather than a definitive statement."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=41, prompt_tokens=160, total_tokens=201, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
Llama-4-Scout correct
ModelResponse(id='nq5vsPi-3NKUce-9304f4329a41d6b3', created=1744652065, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='\n{\n "label": "certain",\n "justification": "The sentence expresses a clear intention or necessity regarding monitoring the effects of a policy stance, indicating a certain action or concern."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=42, prompt_tokens=170, total_tokens=212, completion_tokens_details=None, …
Llama-4-Maverick incorrect
ModelResponse(id='nq5wdHx-zqrih-9304f7cdab1c056f', created=1744652213, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and implications.\n\nThe sentence is: "he/she argued that it will be important to monitor that the effects of this stance are fully transmitted to the economy."\n\n1. Label Assignment: The sentence implies a future action ("it will be important to monitor") based on an argument made by someone ("he/she argued"). The use of "argued" indicates a subjective …

Item 972% solve rateanswer: hawkish

the m3 aggregate, which was still increasing at a moderate rate at our last assessment in june, has begun to advance more rapidly

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsf9don-4yUbBN-93a5fef8ef0fe617', created=1746340714, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states a clear and definitive observation about the m3 aggregate's growth rate without any ambiguity or uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=36, prompt_tokens=500, total_tokens=536, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])

gemini-2.0-flash correct

{
  "label": "certain",
  "justification": "The sentence describes observed past trends in the M3 aggregate, indicating a statement of fact rather than speculation or uncertainty."
}

DeepSeek-V3 correct
ModelResponse(id='nqmA3hk-zqrih-932f49f3598b0607', created=1745095980, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "certain",\n"justification": "The sentence states a clear observation about the m3 aggregate\'s growth rate without any hedging or speculative language."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=39, prompt_tokens=164, total_tokens=203, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, …
Qwen2.5-72B-Instruct-Turbo incorrect
ModelResponse(id='npyGnJs-4yUbBN-92fdc51429893937', created=1744576735, model='together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n "label": "uncertain",\n "justification": "The sentence indicates a change in the trend of the M3 aggregate from a moderate rate to a more rapid advance, which suggests an observation of recent data and implies a degree of uncertainty about the sustainability or future direction of this trend."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=63, …
Llama-4-Maverick incorrect
ModelResponse(id='nq9hWxg-zqrih-930905b7fdbce7fc', created=1744694726, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze its content and tone.\n\nThe sentence is: "the m3 aggregate, which was still increasing at a moderate rate at our last assessment in june, has begun to advance more rapidly."\n\n1. Label Assignment: The sentence reports a factual change in the M3 aggregate's rate of increase, indicating a shift from a moderate rate to a more rapid one. This is a statement of fact based on …
Llama-4-Scout incorrect
ModelResponse(id='nq9goxk-3NKUce-9309025fb98302a2', created=1744694588, model='together_ai/meta-llama/Llama-4-Scout-17B-16E-Instruct', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='Here is the classification:\n\n\n{\n "label": "certain",\n "justification": "The sentence describes a specific observed change in the M3 aggregate, indicating a clear and definite shift from a moderate to a more rapid rate of increase."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=53, prompt_tokens=174, total_tokens=227, …

Item 1075% solve rateanswer: neutral

until the time of the meeting, the effects on local financial markets had been limited.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsdYZUq-4yUbBN-93a4439d18fe81c1', created=1746322554, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states a fact about the effects on local financial markets without using speculative language."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=31, prompt_tokens=474, total_tokens=505, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])
FinMA-7B-full correct
gpt-4o-2024-08-06 correct
ModelResponse(id='chatcmpl-BJ6G8AtR1abxC2L8oVH4D5hZQWD3q', created=1743891992, model='gpt-4o-2024-08-06', object='chat.completion', system_fingerprint='fp_898ac29719', choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n "label": "certain",\n "justification": "The sentence states a definitive observation about the limited effects on local financial markets up to the time of the meeting."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]))], usage=Usage(completion_tokens=42, prompt_tokens=162, total_tokens=204, …
gpt-4.1-2025-04-14 incorrect
ModelResponse(id='chatcmpl-BN5Z8zdNMXIL1bivsChwOPpxKLTRA', created=1744842638, model='gpt-4.1-2025-04-14', object='chat.completion', system_fingerprint='fp_beec22d258', choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence states a clear and definite observation about limited effects on local financial markets up to the meeting time."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}, annotations=[]))], usage=Usage(completion_tokens=35, prompt_tokens=162, total_tokens=197, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=0, …

gemini-2.0-flash incorrect

{
  "label": "certain",
  "justification": "The sentence describes a specific observation about the limited effects on financial markets up to a defined point in time (the meeting), which is presented as a factual statement."
}

Llama-4-Maverick incorrect
ModelResponse(id='nq75ggB-zqrih-930632789afddd21', created=1744665102, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze it step by step.\n\nThe sentence is: "until the time of the meeting, the effects on local financial markets had been limited."\n\n1. Understanding the task: We need to determine if the sentence conveys a certain or uncertain statement regarding the Central Bank of Chile's monetary policy meeting.\n\n2. Analyzing the sentence: The sentence discusses the effects on local …

Item 1181% solve rateanswer: hawkish

new construction and real estate listings are lagging demand, which is adding pressure to prices.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nscKxZ8-4yUbBN-93a2f34d4c026fb0', created=1746308779, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence presents a definitive statement about the current market situation, using the phrase 'are lagging' to describe the relationship between new construction and real estate listings, and demand."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=49, prompt_tokens=434, total_tokens=483, …

gemini-2.0-flash correct

{
  "label": "certain",
  "justification": "The sentence presents a factual observation about the current state of the housing market."
}

DeepSeek-V3 correct
ModelResponse(id='nqP9rTJ-zqrih-93178ad58f0a7003', created=1744846979, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "certain",\n"justification": "The sentence states a clear, factual observation about the relationship between demand and prices in the real estate market."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=39, prompt_tokens=155, total_tokens=194, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, …
FinMA-7B-full incorrect
Qwen2.5-72B-Instruct-Turbo incorrect
ModelResponse(id='npiM5RN-4yUbBN-92eda9504a51828c', created=1744407826, model='together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n "label": "uncertain",\n "justification": "The sentence indicates a trend or condition that is subject to change and does not provide a definitive or guaranteed outcome, thus reflecting uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=43, prompt_tokens=186, total_tokens=229, completion_tokens_details=None, prompt_tokens_details=None), …
Llama-4-Maverick incorrect
ModelResponse(id='nq4qAtb-zqrih-9303c4c5df4953f9', created=1744639637, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, let's analyze it step by step.\n\nThe sentence is: "new construction and real estate listings are lagging demand, which is adding pressure to prices."\n\n1. Understanding the task: We need to determine if the statement is certain or uncertain regarding the Bank of Canada's monetary policy meeting context.\n\n2. Analyzing the sentence: The sentence makes a factual statement about the …

Item 1285% solve rateanswer: dovish

financing conditions in the residential mortgage market remained accommodative over the intermeeting period.

How subjects answered

Llama-3-70b-chat correct
ModelResponse(id='nsePwEo-4yUbBN-93a52ecfab60d6b0', created=1746332188, model='together_ai/meta-llama/Llama-3-70b-chat-hf', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n"label": "certain",\n"justification": "The sentence definitively states that financing conditions remained accommodative, without any ambiguity or uncertainty."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=32, prompt_tokens=465, total_tokens=497, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])

gemini-2.0-flash correct

{
  "label": "certain",
  "justification": "The sentence states a factual observation about financing conditions, indicating a definitive assessment rather than speculation or doubt."
}

Qwen2.5-72B-Instruct-Turbo correct
ModelResponse(id='npuJ8wK-4yUbBN-92f97a68ec7cc932', created=1744531733, model='together_ai/Qwen/Qwen2.5-72B-Instruct-Turbo', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='{\n "label": "certain",\n "justification": "The sentence indicates a clear and specific state of financing conditions without any ambiguity or conditional language."\n}', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=35, prompt_tokens=184, total_tokens=219, completion_tokens_details=None, prompt_tokens_details=None), service_tier=None, prompt=[])
DeepSeek-V3 correct
ModelResponse(id='nqUFsyw-zqrih-931d0cf1ed8ad690', created=1744904737, model='together_ai/deepseek-ai/DeepSeek-V3', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='stop', index=0, message=Message(content='json\n{\n"label": "certain",\n"justification": "The sentence states a factual observation about financing conditions without any hedging or qualifiers."\n}\n', role='assistant', tool_calls=None, function_call=None, provider_specific_fields={'refusal': None}))], usage=Usage(completion_tokens=34, prompt_tokens=154, total_tokens=188, completion_tokens_details=None, prompt_tokens_details=None, cached_tokens=0), service_tier=None, prompt=[])
FinMA-7B-full incorrect
Llama-4-Maverick incorrect
ModelResponse(id='nq99ywK-28Eivz-93086f81beffe5e9', created=1744688573, model='together_ai/meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8', object='chat.completion', system_fingerprint=None, choices=[Choices(finish_reason='length', index=0, message=Message(content='To classify the given sentence, we need to analyze its content and tone to determine whether it conveys certainty or uncertainty regarding the Federal Reserve System's monetary policy meeting.\n\nThe sentence provided is: "financing conditions in the residential mortgage market remained accommodative over the intermeeting period."\n\nLet's break down the task:\n\n1. Label Assignment: The sentence is stating a fact about …

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response (accuracy) across this benchmark's items.

1Llama-3-70b-chat0.7571
2gpt-4.1-2025-04-140.7159
3DeepSeek-V30.7064
4Llama-4-Scout0.6986
5gpt-4.1-mini0.6945
6Qwen2.5-72B-Instruct-Turbo0.6761
7gemini-2.0-flash0.6706
8gpt-4o-2024-08-060.6686
9FinMA-7B-full0.4653
10Llama-4-Maverick0.0206

Full data on Hugging Face Back to the gallery