Mathematics

AlgoTune

AlgoTune: can LLM agents speed up general-purpose numerical programs? 154 coding tasks, each with a reference solver from a popular library, an input generator and a verifier. The agent must produce a correct but faster implementation. Binary response: pass iff the validated speedup over the reference is >= 1.0 (matched-or-beat the reference and was correct).

154items

18subjects

100%observed

MITlicense

software_engineeringdomain

mathematicsdomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 18 subjects × 154 items, 100% of cells evaluated.

AlgoTune response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Vectorized Newton-Raphson

Simultaneously find roots for n instances of a parameterized function f(x, a0, a1, ...) using a single, vectorized call to the Newton-Raphson solver. The input consists of arrays for the initial guesses x0 and the parameters a0, a1 that vary for each instance. The function f and its derivative f' operate element-wise on these arrays. The specific function is f(x, a0..a5) = a1 - a2*(exp((a0+x*a3)/a5) - 1) - (a0+x*a3)/a4 - x.

Input: A dictionary with keys:

"x0": A list of n initial guess values.
"a0": A list of n corresponding 'a0' parameter values.
"a1": A list of n corresponding 'a1' parameter values. (Other parameters a2, a3, a4, a5 are fixed for the task).

Example input: { "x0": [7.0, 7.5], "a0": [5.32, 5.48], "a1": [7.0, 8.0] }

Output: A dictionary with key:

"roots": A numpy array of shape (n,) representing the root found for each corresponding input instance (x0_i, a0_i, a1_i). Use NaN if the method fails to converge for any instance (often, failure might result in NaNs for all if the vectorized call fails globally).

Example output: { "roots": [7.1234, 7.6543] }

Category: numerical_methods

Subject outcomes

claude-opus-4-1-20250805 incorrect
claude-opus-4-20250514 incorrect
claude-opus-4.5 incorrect
gpt-oss-120b incorrect
o4-mini incorrect
qwen3-coder incorrect

Item 233% solve rate

Count Riemann Zeta Zeros task

Given a positive real number t, the task is to count the zeros of the Riemann Zeta function in the critical strip 0 < real(z) < 1 with imag(z) <= t. The output should be an integer giving the total number of zeros in the aforementioned region. It is OK to use arbitrary precision or extended precision arithmetic for intermediate calculations.

Input: A dictionary with keys:

"t": A double precision floating point number giving the max imaginary part in the strip where zeros are to be counted.

Example input: {"t": 2228.0}

Output: A dictionary with keys:

"result": An integer giving the number of zeros of the Riemann zeta function in the critical strip 0 < real(z) < 1 with imag(z) <= t.

Example Output: {"result": 1729}

Category: misc

Subject outcomes

claude-sonnet-4-5-20250929 correct
claude-opus-4.5 correct
gpt-5.4 correct
gpt-oss-120b incorrect
o4-mini incorrect
qwen3-coder incorrect

Item 339% solve rate

FIRLS

Given a filter length n and desired frequency band edges, the task is to design a Finite Impulse Response (FIR) filter using the least-squares method. The desired frequency response is specified as a piecewise constant function with a passband and a stopband defined by the edges. The filter is optimized such that the error between the actual and desired frequency responses is minimized in a least-squares sense. The output is a set of FIR filter coefficients that best approximate the target frequency response.

Input: A tuple consisting of an integer n (filter length) and a pair of frequency band edges.

Example input: (101, (0.1, 0.9))

Output: A 1D array of real numbers representing the FIR filter coefficients.

Example output: [0.0023, 0.0051, 0.0074, 0.0089, 0.0081, 0.0049, -0.0007, -0.0083, -0.0150, -0.0172, -0.0118, 0.0000, 0.0143, 0.0248, 0.0248, 0.0118, -0.0078, -0.0224, -0.0277, -0.0202, -0.0007]

Category: signal_processing

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4-20250514 correct
glm-4.5 correct
gpt-oss-120b incorrect
o4-mini incorrect
qwen3-coder incorrect

Item 456% solve rate

Edge Expansion

Calculate the edge expansion for a given subset of nodes S in a directed graph G=(V, E). Edge expansion measures the ratio of edges leaving the set S to the size of the set S. It is formally defined as:

expansion(S) = |E(S, V-S)| / |S|

where E(S, V-S) is the set of edges (u, v) such that u is in S and v is in V-S (the set of nodes not in S). If S is empty or S contains all nodes in V, the edge expansion is defined as 0.

Input: A dictionary containing two keys:

"adjacency_list": An array representing the graph's directed adjacency structure. adjacency_list[i] contains a sorted list of integer indices corresponding to the nodes that node i points to. Nodes are implicitly indexed from 0 to n-1.
"nodes_S": A sorted list of unique integer node indices belonging to the subset S.

Example input: { "adjacency_list": [ [1, 2], [2], [0] ], "nodes_S": [0, 1] }

Output: A dictionary containing a single key "edge_expansion". The value is a floating-point number representing the calculated edge expansion of the set S in the graph.

Example output: { "edge_expansion": 1.0 }

Category: graph

Subject outcomes

claude-opus-4-20250514 correct
claude-opus-4.5 correct
gpt-5.4 correct
gpt-5-mini incorrect
gpt-5 incorrect
qwen3-coder incorrect

Item 561% solve rate

LTI System Simulation

Given the transfer function coefficients (numerator num, denominator den) of a continuous-time Linear Time-Invariant (LTI) system, an input signal u defined over a time vector t, compute the system's output signal yout at the specified time points. The system is represented by H(s) = num(s) / den(s).

Input: A dictionary with keys:

"num": A list of numbers representing the numerator polynomial coefficients (highest power first).
"den": A list of numbers representing the denominator polynomial coefficients (highest power first).
"u": A list of numbers representing the input signal values at each time point in t.
"t": A list of numbers representing the time points (must be monotonically increasing).

Example input: { "num": [1.0], "den": [1.0, 0.2, 1.0], "t": [0.0, 0.1, 0.2, 0.3, 0.4, 0.5], "u": [0.0, 0.1, 0.15, 0.1, 0.05, 0.0] }

Output: A dictionary with key:

"yout": A list of numbers representing the computed output signal values corresponding to the time points in t.

Example output: { "yout": [0.0, 0.00048, 0.0028, 0.0065, 0.0098, 0.0115] }

Category: control

Subject outcomes

claude-opus-4.6 correct
claude-opus-4.5 correct
o4-mini correct
gpt-5-pro (medium) incorrect
gpt-5-mini incorrect
gpt-oss-120b incorrect

Item 672% solve rate

Least Squares Task:

Given a set of data points and a model function, the task is to find the parameters of the model that best fit the data by minimizing the sum of the squares of the residuals between the model predictions and the observed data.

Input: A dictionary with keys:

"n": An integer representing the number of data points.
"x_data": A list of n numbers representing the x coordinates of the data points.
"y_data": A list of n numbers representing the y coordinates of the data points.
"model_type": A string indicating the type of model to fit ("polynomial", "exponential", "logarithmic", "sigmoid", or "sinusoidal").
"degree": An integer representing the degree of the polynomial (only present if model_type is "polynomial").

Example input: { "n": 50, "x_data": [0.0, 0.2, 0.4, ..., 9.8], "y_data": [2.1, 3.5, 4.8, ..., 95.3], "model_type": "polynomial", "degree": 2, }

Output: A dictionary with keys:

"params": A list of numbers representing the estimated parameters of the model.

Example output: { "params": [1.98, 3.02, 0.99] }

Category: statistics

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4-20250514 correct
claude-opus-4.6 correct
claude-sonnet-4-5-20250929 incorrect
gpt-5-mini incorrect
o4-mini incorrect

Item 772% solve rate

Set Cover with Conflicts

Given a number of objects, a collection of sets—including sets that contain only a single object—and a list of conflicts, the task is to find the minimum number of sets required to cover all objects while ensuring that no conflicting sets are selected simultaneously. Each object appears in at least one set, and the collection always includes the trivial solution where each object is covered individually by a separate set, i.e., [0], [1], [2], .... These trivial sets are guaranteed not to appear in any conflict.

Input: A tuple containing:

An integer n, the number of objects (indexed from 0).
A list of sets, where each set is represented as a list of object indices. This list always includes the trivial sets [0], [1], ..., [n-1], each covering a single object.
A list of conflicts, where each conflict is a list of indices referring to sets that cannot be selected together. The trivial sets are never part of any conflict.

Example Input:
(
5,
[
[0], [1], [2], [3], [4],
[0, 1],
[1, 2],
[2, 3],
[0, 3],
[1, 3]
],
[
[5, 6],
[7, 8]
]
)

Output: A list of selected set indices forming a valid cover with minimal size.

Example Output:
[5, 7, 4]

Category: discrete_optimization

Subject outcomes

claude-opus-4-20250514 correct
claude-opus-4.5 correct
qwen3-coder correct
claude-opus-4.6 incorrect
gpt-5-mini incorrect
gpt-oss-120b incorrect

Item 878% solve rate

Toeplitz solver task:

Given a square Toeplitz matrix T and a vector b, the task is to solve the linear system Tx = b. This task uses the Levinson-Durbin algorithm (scipy.linalg.solve_toeplitz) that is faster than normal linear system solve by taking advantage of the Toeplitz structure.

Input: A dictionary with keys:

"c": A list of n numbers representing the first column of the Toeplitz matrix.
"r": A list of n numbers representing the first row of the Toepltiz matrix.
"b": A list of n numbers representing the right-hand side vector b.

Example input: { "c": [1., 2., 9., 4.], "r": [1., -2., -1., -5.], "b": [2., 3., 6., 7.] }

Output: A list of n numbers representing the solution vector x that satisfies T x = b.

Example output: [0.43478261, 0.73913043, -0.43478261, -0.52173913]

Category: matrix_operations

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4.5 correct
o4-mini correct
claude-opus-4.6 incorrect
gemini-2.5-pro incorrect
gpt-5-pro (medium) incorrect

Item 983% solve rate

FFT Complex

This task requires computing the N-dimensional Fast Fourier Transform (FFT) of a complex-valued matrix.
The FFT is a mathematical technique that converts data from the spatial (or time) domain into the frequency domain, revealing both the magnitude and phase of the frequency components.
The input is a square matrix of size n×n, where each element is a complex number containing both real and imaginary parts.
The output is a square matrix of the same size, where each entry is a complex number representing a specific frequency component of the input data, including its amplitude and phase.
This transformation is crucial in analyzing signals and data with inherent complex properties.

Input: A complex-valued n×n matrix represented as an array of complex numbers.

Example input: [[0.5+0.5j, 0.7+0.7j], [0.2+0.2j, 0.9+0.9j]]

Output: An n×n matrix of complex numbers, where each element provides the frequency-domain information of the corresponding element in the input matrix.

Example output: [[(1.8+0.3j), (-0.2+0.1j)], [(0.3-0.1j), (0.6+0.2j)]]

Category: signal_processing

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4-20250514 correct
claude-opus-4.5 correct
gpt-5-pro (medium) incorrect
gpt-oss-120b incorrect
gpt-5.4 incorrect

Item 1089% solve rate

Vertex Cover Given an undirected graph G, find the smallest set of vertices such that every edge has at least one endpoint in the set.

Input: A 2D array A with value 0/1 representing the adjacency matrix A[i][j] = 0 : there is no edge between i, j A[i][j] = 1 : there is an edge between i, j The input should be symmetric

Example input: [ [0,1,0,1], [1,0,1,0], [0,1,0,1], [1,0,1,0] ]

Output: A list showing the index of the selected nodes

Example output: [0, 2]

Category: discrete_optimization

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4.5 correct
gpt-5 correct
o4-mini correct
claude-opus-4-20250514 incorrect
gpt-oss-120b incorrect

Item 1194% solve rate

1D Burgers' Equation Solver Task:

This task involves solving the one-dimensional Burgers' equation, a fundamental nonlinear partial differential equation that combines both advection and diffusion. The equation is given by:

$\frac{\partial u}{\partial t} + u \frac{\partial u}{\partial x} = \nu \frac{\partial^2 u}{\partial x^2}$

Where:

u(x,t) is the velocity field
ν is the viscosity coefficient
The term u∂u/∂x represents nonlinear advection (wave propagation)
The term ν∂²u/∂x² represents diffusion (smoothing)

The problem is solved using the method of lines with an upwind scheme for the advection term and central differences for the diffusion term:

$\frac{du_i}{dt} = -u_i \frac{du}{dx}\bigg|_i + \nu \frac{u_{i+1} - 2u_i + u_{i-1}}{(\Delta x)^2}$

Where ∂u/∂x is computed using upwind differencing based on the sign of u to maintain numerical stability.

The system uses Dirichlet boundary conditions (u=0 at both ends) and an initial condition designed to develop shocks, using either a sine wave or carefully positioned Gaussian bumps with small random perturbations.

Input: A dictionary with the following keys:

t0: Initial time (float)
t1: Final time (float, fixed at 0.5)
y0: Initial velocity at each interior grid point (list of floats)
params: Dictionary containing:
- nu: Viscosity coefficient (float, fixed at 0.005)
- dx: Grid spacing (float)
- num_points: Number of interior grid points (integer, scales as 20 * n)
x_grid: Spatial coordinates of interior grid points (list of floats)

Example input:

{
  "t0": 0.0,
  "t1": 0.5,
  "y0": [0.0, 0.3, 0.5, ..., -0.2],  # Values at each grid point
  "params": {
    "nu": 0.005,
    "dx": 0.0476,
    "num_points": 80
  },
  "x_grid": [-0.95, -0.91, ..., 0.95]  # Interior grid points
}

Output: A list of floating-point numbers representing the solution u(x,t1) at each interior grid point.

Example output:

[0.0, 0.02, 0.08, ..., -0.04]

Category: differential_equation

Subject outcomes

claude-opus-4-20250514 correct
claude-opus-4.5 correct
gpt-5 correct
gpt-oss-120b correct
qwen3-coder correct
claude-opus-4-1-20250805 incorrect

Item 1294% solve rate

GeneralizedEigenvectorsComplex Task:

Given two matrices A and B, where:

A and B are arbitrary real n x n matrices, the task is to solve the generalized eigenvalue problem:

A · x = λ B · x

and compute the generalized eigenpairs (eigenvalues and eigenvectors).

In this task, the eigenvalues may be complex and the corresponding eigenvectors may also be complex. The goal is to compute the approximated eigenpairs and return:

A list of eigenvalues (complex numbers) sorted in descending order, with sorting order defined as: first by the real part (in descending order), then by the imaginary part (in descending order).
A list of corresponding generalized eigenvectors (each represented as a list of complex numbers), where each eigenvector is normalized to have unit Euclidean norm.

A valid solution is a tuple (eigenvalues, eigenvectors) where:

eigenvalues is a list of n numbers (complex or real) sorted in descending order.
eigenvectors is an array of n eigenvectors, each of length n, representing the eigenvector corresponding to the eigenvalue at the same index.

A given solution's distance is defined as the average angular difference (in radians) between the computed eigenvectors (obtained by running the solver on the problem) and the provided solution. For each eigenvector pair, the angular difference is computed as:

angle = arccos( |v_computedᴴ v_solution| )

Input: Two matrices A and B represented as arrays of real numbers.

A and B are arbitrary (not necessarily symmetric).

Example input: A = [ [1.0, 2.0], [3.0, 4.0] ] B = [ [2.0, 0.5], [1.5, 3.0] ]

Output: A tuple consisting of:

A list of approximated eigenvalues (which may be complex) sorted in descending order.
A list of corresponding generalized eigenvectors (each a list of complex numbers) normalized to unit Euclidean norm.

Example output: ( [(2.3+0.5j), (0.7-0.5j)], [ [(0.8+0j), (0.6+0j)], [(0.4+0.3j), (-0.7+0.2j)] ] )

Category: matrix_operations

Subject outcomes

claude-opus-4-1-20250805 correct
claude-opus-4-20250514 correct
claude-opus-4.5 correct
gpt-oss-120b correct
qwen3-coder correct
claude-opus-4.6 incorrect

Subjects

The models, agents, and reward models evaluated.

18 subjects, ranked by mean response (accuracy) across this benchmark's items.

1gemini-3.1-pro-preview0.8701
2gpt-5.40.8312
3gpt-5.20.8312
4gpt-50.8117
5claude-opus-4.50.8117
6claude-sonnet-4-5-202509290.7792
7deepseek-reasoner0.7792
8o4-mini0.7662
9glm-4.50.7662
10gemini-3-pro-preview0.7403
11qwen3-coder0.6948
12claude-opus-4-1-202508050.6558
13gpt-5-mini0.6104
14gpt-oss-120b0.6039
15gemini-2.5-pro0.5714
16claude-opus-4-202505140.5584
17claude-opus-4.60.5455
18gpt-5-pro (medium)0.5065

Full data on Hugging Face Back to the gallery