Coding & Software

GSO

GSO: challenging software-optimization tasks for evaluating SWE-agents. Each task asks an agent to reproduce a real performance optimization in a repository, graded by the Opt@K metric (the agent patch must meet the optimization target under a performance script plus correctness tests). We store the real per-task descriptions and per-(model, instance) Opt@1 pass/fail outcomes from the official gso-experiments reports.

102items

21subjects

95%observed

Agentsubject type

MITlicense

software_engineeringdomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 21 subjects × 102 items, 95% of cells evaluated.

GSO response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: diff --git a/src/datasets/arrow_dataset.py b/src/datasets/arrow_dataset.py index 22150f4ce..508f0fbf1 100644 --- a/src/datasets/arrow_dataset.py +++ b/src/datasets/arrow_dataset.py @@ -76,6 +76,7 @@ f

GSO software-optimization task

Repository: huggingface/datasets Target API / function: Dataset._select_contiguous

Optimization goal (ground-truth commit message)

Optimize contiguous shard and select (#4466)

optimize contiguous shard and select
minor
support iterators (and therefore generators)
comments + docstrings

Performance test script (prob_script)

import os import json import random import timeit from datasets import Dataset

def setup() -> Dataset: random.seed(42) N = 200000 vocabulary = ['lorem', 'ipsum', 'dolor', 'sit', 'amet', 'consectetur', 'adipiscing', 'elit', 'vestibulum', 'ante', 'primis', 'in', 'faucibus', 'orci', 'luctus', 'ultrices', 'nulla', 'facilisi', 'curabitur', 'sagittis', 'mattis', 'dictum'] texts = [' '.join(random.choices(vocabulary, k=random.randint(5, 15))) for _ in range(N)] data = {'id': list(range(N)), 'text': texts, 'value': [random.uniform(0, 1) for _ in range(N)]} dataset = Dataset.from_dict(data) return dataset

def experiment(dataset: Dataset) -> dict: total_rows = len(dataset) start_index = int(0.1 * total_rows) selected_length = int(0.5 * total_rows) if start_index + selected_length > total_rows: selected_length = total_rows - start_index contiguous_range = range(start_index, start_index + selected_length) selected_dataset = dataset.select(contiguous_range) values = selected_dataset['value'] total_value = sum(values) min_value = min(values) max_value = max(values) result = {'selected_rows': len(selected_dataset), 'start_index': start_index, 'end_index': start_index + selected_length - 1, 'first_id': selected_dataset[0]['id'], 'first_text': selected_dataset[0]['text'], 'last_id': selected_dataset[-1]['id'], 'last_text': selected_dataset[-1]['text'], 'total_value': total_value, 'min_value': min_value, 'max_value': max_value} return result

def store_result(result: dict, file_name: str) -> None: with open(file_name, 'w') as f: json.dump(result, f)

def load_result(file_name: str) -> dict: with open(file_name, 'r') as f: result = json.load(f) return result

def check_equivalence(reference_result: dict, current_result: dict) -> None: assert reference_result['selected_rows'] == current_result['selected_rows'], f'Selected rows mismatch: {reference_result['selected_rows']} != {current_result['selected_rows']}' assert reference_result['start_index'] == current_result['start_index'], f'Start index mismatch: {reference_result['start_index']} != {current_result['start_index']}' assert reference_result['end_index'] == current_result['end_index'], f'End index mismatch: {reference_result['end_index']} != {current_result['end_index']}' assert reference_result['first_id'] == current_result['first_id'], f'First id mismatch: {reference_result['first_id']} != {current_result['first_id']}' assert reference_result['first_text'] == current_result['first_text'], f'First text mismatch: {reference_result['first_text']} != {current_result['first_text']}' assert reference_result['last_id'] == current_result['last_id'], f'Last id mismatch: {reference_result['last_id']} != {current_result['last_id']}' assert reference_result['last_text'] == current_result['last_text'], f'Last text mismatch: {reference_result['last_text']} != {current_result['last_text']}' tol = 1e-06 assert abs(reference_result['total_value'] - current_result['total_value']) < tol, f'Total value mismatch: {reference_result['total_value']} != {current_result['total_value']}' assert abs(reference_result['min_value'] - current_result['min_value']) < tol, f'Min value mismatch: {reference_result['min_value']} != {current_result['min_value']}' assert abs(reference_result['max_value'] - current_result['max_value']) < tol, f'Max value mismatch: {reference_result['max_value']} != {current_result['max_value']}'

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str='') -> float: dataset = setup() execution_time, result = timeit.timeit(lambda: experiment(dataset), number=1) file_name = f'{prefix}_result.json' if prefix else 'reference_result.json' if reference: store_result(result, file_name) if eqcheck: ref_result = load_result(file_name) check_equivalence(ref_result, result) return execution_time

Subject outcomes

claude-opus-4.5 incorrect
glm-4.5 incorrect
claude-opus-4.6 incorrect

Item 20% solve rateanswer: diff --git a/.github/workflows/build-and-release.yaml b/.github/workflows/build-and-release.yaml index bfa97de..7307c85 100644 --- a/.github/workflows/build-and-release.yaml +++ b/.github/workflows/bu

GSO software-optimization task

Repository: abetlen/llama-cpp-python Target API / function: llama_cpp.gen_b

Optimization goal (ground-truth commit message)

feat: Update llama.cpp

Performance test script (prob_script)

import argparse import json import math import os import timeit import time import random import numpy as np from llama_cpp import Llama import huggingface_hub os.environ['HF_HUB_ENABLE_HF_TRANSFER'] = '1'

def download_model(): repo_id = 'Qwen/Qwen2-7B-Instruct-GGUF' model_path = None model_name = 'qwen2-7b-instruct-q4_0.gguf' potential_path = os.path.join(os.getcwd(), 'models', model_name) if os.path.exists(potential_path): print(f'Found local model at {potential_path}') model_path = potential_path if model_path is None: try: os.makedirs('models', exist_ok=True) print('Downloading model...') huggingface_hub.hf_hub_download(repo_id=repo_id, filename=model_name, local_dir='models') model_path = potential_path print(f'Downloaded model to {model_path}') except Exception as e: raise RuntimeError(f'Error downloading model: {e}') return model_path

def setup(): model_path = download_model() llm = Llama(model_path=model_path, n_ctx=4096, seed=42, verbose=False) sharegpt_path = './sharegpt_dataset.json' if not os.path.exists(sharegpt_path): try: import requests print('Downloading ShareGPT dataset...') url = 'https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json' response = requests.get(url) with open(sharegpt_path, 'wb') as f: f.write(response.content) print('Download complete!') except Exception as e: raise RuntimeError(f'Error downloading ShareGPT dataset: {e}. Please download it manually.') with open(sharegpt_path, 'r', encoding='utf-8') as f: sharegpt_data = json.load(f) sharegpt_data = [entry for entry in sharegpt_data if 'conversations' in entry and len(entry['conversations']) >= 2] random.seed(42) random.shuffle(sharegpt_data) num_samples = min(10, len(sharegpt_data)) test_prompts = [] for i in range(num_samples): entry = sharegpt_data[i] prompt = entry['conversations'][0]['value'] completion = entry['conversations'][1]['value'] prompt_tokens = len(prompt.split()) completion_tokens = len(completion.split()) test_prompts.append({'prompt': prompt, 'expected_completion': completion, 'prompt_tokens': prompt_tokens, 'completion_tokens': completion_tokens, 'max_tokens': min(300, completion_tokens + 50), 'temperature': 0}) return (llm, test_prompts)

def experiment(setup_result): llm, test_prompts = setup_result results = {'successful_requests': 0, 'prompt_results': [], 'metrics': {'total_input_tokens': 0, 'total_output_tokens': 0, 'total_tokens': 0, 'total_inference_time': 0, 'request_throughput': 0, 'output_token_throughput': 0, 'total_token_throughput': 0, 'tpot_s': [], 'e2e_latency_s': []}} for idx, prompt_data in enumerate(test_prompts): prompt = prompt_data['prompt'] max_tokens = prompt_data['max_tokens'] temperature = prompt_data['temperature'] start_time = time.time() completion = llm(prompt, max_tokens=max_tokens, temperature=temperature, echo=False) print(f'Prompt {idx} completed.') end_time = time.time() total_time = end_time - start_time prompt_tokens = completion['usage']['prompt_tokens'] completion_tokens = completion['usage']['completion_tokens'] total_tokens = completion['usage']['total_tokens'] if completion_tokens > 0: results['successful_requests'] += 1 tpot = total_time / max(1, completion_tokens) if completion_tokens > 1 else 0 results['metrics']['total_input_tokens'] += prompt_tokens results['metrics']['total_output_tokens'] += completion_tokens results['metrics']['total_tokens'] += total_tokens results['metrics']['total_inference_time'] += total_time results['metrics']['tpot_s'].append(tpot) results['metrics']['e2e_latency_s'].append(total_time) results['prompt_results'].append({'prompt_idx': idx, 'completion': completion['choices'][0]['text'], 'prompt_tokens': prompt_tokens, 'completion_tokens': completion_tokens, 'total_tokens': total_tokens, 'inference_time_s': total_time, 'tpot_s': tpot, 'tokens_per_second': completion_tokens / total_time}) if results['successful_requests'] > 0: total_time = results['metrics']['total_inference_time'] results['metrics']['request_throughput'] = results['successful_requests'] / total_time results['metrics']['output_token_throughput'] = results['metrics']['total_output_tokens'] / total_time results['metrics']['total_token_throughput'] = results['metrics']['total_tokens'] / total_time for metric in ['tpot_s', 'e2e_latency_s']: values = results['metrics'][metric] results['metrics'][f'mean_{metric}'] = np.mean(values) results['metrics'][f'median_{metric}'] = np.median(values) results['metrics'][f'p90_{metric}'] = np.percentile(values, 90) results['metrics'][f'p95_{metric}'] = np.percentile(values, 95) results['metrics'][f'p99_{metric}'] = np.percentile(values, 99) return results

def store_result(result: dict, filename: str): os.makedirs(os.path.dirname(os.path.abspath(filename)), exist_ok=True) with open(filename, 'w') as f: json.dump(result, f, indent=2)

def load_result(filename: str): if not os.path.exists(filename): raise FileNotFoundError(f'Reference file {filename} does not exist.') with open(filename, 'r') as f: result_dict = json.load(f) return result_dict

def check_equivalence(reference, current): ref_results = {pr['prompt_idx']: pr['completion'] for pr in reference['prompt_results']} curr_results = {pr['prompt_idx']: pr['completion'] for pr in current['prompt_results']} for idx, ref_completion in ref_results.items(): if idx not in curr_results: raise AssertionError(f'Prompt {idx} was unsuccesful!') curr_completion = curr_results[idx] if ref_completion != curr_completion: raise AssertionError(f'Prompt {idx} completions mismatch:\nReference: {ref_completion}\nCurrent: {curr_completion}') print('Equivalence check passed!')

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str=''): setup_result = setup() result = experiment(setup_result) e2e_latency = result['metrics'].get('mean_e2e_latency_s', 0) filename = f'{prefix}_result.json' if prefix else 'reference_result.json' if reference: store_result(result, filename) elif eqcheck: ref_result = load_result(filename) check_equivalence(ref_result, result) return e2e_latency

Subject outcomes

claude-opus-4.6-high incorrect
claude-opus-4.6 incorrect
claude-opus-4.7 incorrect

Item 310% solve rateanswer: diff --git a/benchmarks/benchmarks/bench_core.py b/benchmarks/benchmarks/bench_core.py index fb988fabf1..632318d610 100644 --- a/benchmarks/benchmarks/bench_core.py +++ b/benchmarks/benchmarks/bench_c

GSO software-optimization task

Repository: numpy/numpy Target API / function: numpy.char.startswith

Optimization goal (ground-truth commit message)

Merge pull request #24947 from lysnikolaou/string-ufuncs-startswith-endswith

Performance test script (prob_script)

import numpy as np import random import string import json import timeit

def setup(): random.seed(12345) np.random.seed(12345) ascii_pool = string.ascii_letters + string.digits + string.punctuation + ' ' unicode_chars = 'αБ語😊👍𐍈' num_large = 10 large_strings = [] for i in range(num_large): length = random.randint(50000, 100000) if i < 3: s = 'EDGE' + ''.join(random.choices(ascii_pool, k=length - 4)) else: s = ''.join(random.choices(ascii_pool, k=length)) large_strings.append(s) arr_small_big = np.array(large_strings, dtype=f'<U{max((len(s) for s in large_strings))}') large_prefix = 'EDGE' num_small = 50000 small_strings = [] for _ in range(num_small): length = random.randint(1, 10) s = ''.join(random.choices(ascii_pool, k=length)) small_strings.append(s) arr_big_small = np.array(small_strings, dtype=f'<U10') big_prefix = large_strings[0] arr_empty = np.array([], dtype='<U1') arr_strided = arr_big_small.reshape(25000, 2)[:, 0] strided_start = -3 strided_end = 8 num_uni_small = 2000 unicode_small = [] for i in range(num_uni_small): length = random.randint(1, 4) s = ''.join(random.choices(unicode_chars, k=length)) unicode_small.append(s) unicode_small = np.array(unicode_small, dtype=f'<U4') unicode_prefix = unicode_chars[0] num_uni_large = 5 unicode_large = [] for i in range(num_uni_large): length = random.randint(5000, 10000) body = ''.join(random.choices(unicode_chars, k=length)) if i < 2: s = unicode_prefix + body else: s = body unicode_large.append(s) unicode_large = np.array(unicode_large, dtype=f'<U{max((len(s) for s in unicode_large))}') return {'arr_small_big': arr_small_big, 'large_prefix': large_prefix, 'arr_big_small': arr_big_small, 'big_prefix': big_prefix, 'arr_empty': arr_empty, 'arr_strided': arr_strided, 'strided_start': strided_start, 'strided_end': strided_end, 'unicode_small': unicode_small, 'unicode_prefix': unicode_prefix, 'unicode_large': unicode_large}

def experiment(data): m1 = np.char.startswith(data['arr_small_big'], data['large_prefix']) count_small_big = int(np.sum(m1)) total_small_big = data['arr_small_big'].shape[0] m2 = np.char.startswith(data['arr_big_small'], data['big_prefix']) count_big_small = int(np.sum(m2)) total_big_small = data['arr_big_small'].shape[0] m3 = np.char.startswith(data['arr_empty'], 'anything') count_empty = int(np.sum(m3)) total_empty = data['arr_empty'].shape[0] m4 = np.char.startswith(data['arr_strided'], data['large_prefix'][:2], start=data['strided_start'], end=data['strided_end']) count_strided = int(np.sum(m4)) total_strided = data['arr_strided'].shape[0] m5 = np.char.startswith(data['unicode_small'], data['unicode_prefix']) count_unicode_small = int(np.sum(m5)) total_unicode_small = data['unicode_small'].shape[0] m6 = np.char.startswith(data['unicode_large'], data['unicode_prefix']) count_unicode_large = int(np.sum(m6)) total_unicode_large = data['unicode_large'].shape[0] return {'count_small_big': count_small_big, 'total_small_big': total_small_big, 'count_big_small': count_big_small, 'total_big_small': total_big_small, 'count_empty': count_empty, 'total_empty': total_empty, 'count_strided': count_strided, 'total_strided': total_strided, 'count_unicode_small': count_unicode_small, 'total_unicode_small': total_unicode_small, 'count_unicode_large': count_unicode_large, 'total_unicode_large': total_unicode_large}

def store_result(result, filename): with open(filename, 'w') as f: json.dump(result, f)

def load_result(filename): with open(filename, 'r') as f: return json.load(f)

def check_equivalence(reference, current): for key in reference: ref_val = int(reference[key]) cur_val = int(current.get(key, None)) assert ref_val == cur_val, f"Mismatch for '{key}': reference={ref_val}, current={cur_val}"

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str='') -> float: data = setup() execution_time, current_result = timeit.timeit(lambda: experiment(data), number=1) filename = f'{prefix}_result.json' if prefix else 'reference_result.json' if reference: store_result(current_result, filename) if eqcheck: ref = load_result(filename) check_equivalence(ref, current_result) return execution_time

Subject outcomes

claude-opus-4.5 correct
claude-opus-4.7 correct
claude-opus-4.6 incorrect

Item 418% solve rateanswer: diff --git a/numpy/core/code_generators/generate_umath.py b/numpy/core/code_generators/generate_umath.py index ff32cf1b51..b45f89344d 100644 --- a/numpy/core/code_generators/generate_umath.py +++ b/nu

GSO software-optimization task

Repository: numpy/numpy Target API / function: np.divide.at

Optimization goal (ground-truth commit message)

MAINT, BUG: fixes from review and testing

Performance test script (prob_script)

import argparse import os import numpy as np import timeit

def setup(): np.random.seed(42) a = np.random.rand(1000000).astype(np.float64) + 1.0 indices = np.random.randint(0, a.size, size=500000, dtype=np.intp) divisors = np.random.rand(500000).astype(np.float64) + 0.5 return {'a': a, 'indices': indices, 'divisors': divisors}

def experiment(workload): a = workload['a'] indices = workload['indices'] divisors = workload['divisors'] np.divide.at(a, indices, divisors) return a

def store_result(result, filename): np.save(filename, result)

def load_result(filename): return np.load(filename, allow_pickle=False)

def check_equivalence(reference_result, current_result): assert reference_result.shape == current_result.shape, 'Shape mismatch between reference and current results.' if not np.allclose(reference_result, current_result, rtol=1e-05, atol=1e-08): raise AssertionError('Numerical values of the arrays differ beyond acceptable tolerance.')

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str='') -> float: workload = setup() stmt = lambda: experiment(workload) execution_time, result = timeit.timeit(stmt, number=1) filename = f'{prefix}_result.npy' if prefix else 'reference_result.npy' if reference: store_result(result, filename) if eqcheck: reference_result = load_result(filename) check_equivalence(reference_result, result) return execution_time

Subject outcomes

claude-opus-4.7 correct
gpt-5.2 correct
claude-opus-4.6 incorrect

Item 535% solve rateanswer: diff --git a/doc/source/whatsnew/v2.2.0.rst b/doc/source/whatsnew/v2.2.0.rst index dce776755a..16972ec195 100644 --- a/doc/source/whatsnew/v2.2.0.rst +++ b/doc/source/whatsnew/v2.2.0.rst @@ -428,6 +42

GSO software-optimization task

Repository: pandas-dev/pandas Target API / function: DataFrameGroupBy.nunique

Optimization goal (ground-truth commit message)

PERF: groupby.nunique (#56061)

PERF: groupby.nunique
Remove fastpath
Remove fastpath
int32 fixup
fixup

Performance test script (prob_script)

import pandas as pd import numpy as np import timeit import json

def setup(): np.random.seed(42) unique_values = np.arange(30000, dtype=np.int64) data = np.random.choice(unique_values, size=1000000) df = pd.DataFrame({'A': data, 'B': data % 100}) return df

def experiment(df): result = df.groupby('A').nunique() return result

def store_result(result, filename): result_dict = {'index': result.index.tolist(), 'B': result['B'].tolist()} with open(filename, 'w') as f: json.dump(result_dict, f)

def load_result(filename): with open(filename, 'r') as f: result_dict = json.load(f) index = pd.Index(result_dict['index']) data = {'B': result_dict['B']} return pd.DataFrame(data, index=index)

def check_equivalence(reference_result, current_result): assert reference_result.equals(current_result), 'Results do not match!'

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str='') -> float: df = setup() execution_time, result = timeit.timeit(lambda: experiment(df), number=1) if reference: store_result(result, f'{prefix}_result.json') if eqcheck: reference_result = load_result(f'{prefix}_result.json') check_equivalence(reference_result, result) return execution_time

Subject outcomes

claude-opus-4.5 correct
claude-opus-4.7 correct
claude-opus-4.6 incorrect

Item 676% solve rateanswer: diff --git a/tornado/speedups.c b/tornado/speedups.c index c59bda00..b714268a 100644 --- a/tornado/speedups.c +++ b/tornado/speedups.c @@ -1,9 +1,12 @@ #define PY_SSIZE_T_CLEAN #include <Python.h> +

GSO software-optimization task

Repository: tornadoweb/tornado Target API / function: tornado.websocket.WebSocketClientConnection.write_message

Optimization goal (ground-truth commit message)

Merge pull request #2024 from pjknkda/master

websocket: optimize C websocket_mask function

Performance test script (prob_script)

import argparse import json import os import random import timeit import requests import gzip from tornado.speedups import websocket_mask

class DummyIOStream:

def __init__(self):
    self.buffer = b''

def write(self, data):
    self.buffer += data

class DummyWebSocketClientConnection:

def __init__(self, stream):
    self.stream = stream
    self.is_client = True

def write_message(self, message, binary=True):
    mask = b'abcd'
    if binary:
        masked = websocket_mask(mask, message)
    else:
        msg_bytes = message.encode('utf-8')
        masked = websocket_mask(mask, msg_bytes)
    self.stream.write(masked)
    return len(masked)

def setup(): random.seed(42) data_size = 100 * 1024 message = os.urandom(data_size) stream = DummyIOStream() connection = DummyWebSocketClientConnection(stream) return {'connection': connection, 'message': message}

def experiment(connection, message): connection.stream.buffer = b'' written_length = connection.write_message(message, binary=True) buffer_length = len(connection.stream.buffer) return {'written_length': written_length, 'buffer_length': buffer_length}

def store_result(result, filename): try: with open(filename, 'w') as f: json.dump(result, f) except Exception as e: raise RuntimeError(f'Error storing result to {filename}: {e}')

def load_result(filename): try: with open(filename, 'r') as f: result = json.load(f) return result except Exception as e: raise RuntimeError(f'Error loading result from {filename}: {e}')

def check_equivalence(reference_result, current_result): ref_keys = set(reference_result.keys()) cur_keys = set(current_result.keys()) assert ref_keys == cur_keys, f'Result keys mismatch: {ref_keys} != {cur_keys}' for key in ref_keys: ref_val = reference_result[key] cur_val = current_result[key] assert ref_val == cur_val, f"Mismatch in '{key}': reference {ref_val} vs current {cur_val}"

def run_test(eqcheck: bool=False, reference: bool=False, prefix: str='') -> float: data_dict = setup() connection = data_dict['connection'] message = data_dict['message'] ref_filename = f'{prefix}_result.json' if prefix else 'reference_result.json' experiment(connection, message) number = 1 total_time, result = timeit.timeit(stmt=lambda: experiment(connection, message), number=1, timer=timeit.default_timer) average_time = total_time / number if reference: store_result(result, ref_filename) if eqcheck: reference_result = load_result(ref_filename) check_equivalence(reference_result, result) return average_time

Subject outcomes

claude-opus-4.5 correct
gemini-3-pro correct
o3 incorrect

Subjects

The models, agents, and reward models evaluated.

21 subjects, ranked by mean response (accuracy) across this benchmark's items.

1claude-opus-4.70.441
2claude-opus-4.6-high0.424
3gpt-5.5-xhigh0.402
4gpt-5.4-xhigh0.348
5claude-opus-4.60.333
6gpt-5.20.322
7claude-opus-4.50.278
8gpt-5.4-high0.257
9gemini-3.1-pro0.245
10gemini-3-pro0.186
11claude-sonnet-4.50.152
12gpt-5.10.151
13o30.125
14gemini-3-flash0.103
15claude-opus-40.071
16gpt-50.07
17qwen3-coder0.05
18kimi-k20.05
19claude-sonnet-40.042
20Gemini-2.5-Pro0.04
21glm-4.50.032

Full data on Hugging Face Back to the gallery