Reasoning & Knowledge

URB

29 real-world traffic networks benchmarking MARL routing for autonomous vehicles.

2items

10subjects

55%observed

Agentsubject type

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 10 subjects × 2 items, 55% of cells evaluated.

URB response matrix: AI models (rows) against items (columns) — lowhighUnobserved
Scale: Per metric, each on its own scale: routing win-rate in [0, 1]; travel time, speed, and mileage on native continuous scales (minutes / km). Random seeds are averaged.

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Subjects

The models, agents, and reward models evaluated.

10 subjects, ranked by mean response across this benchmark's items.

159f469ab21.827
220d4083e16.382
31b90aa0d15.478
4f2f59ec515.221
51fca9f1413.424
60673279612.001
7136608499.661
8a441b15f5.403
9e2ca84db4.71
10370aed694.647

Full data on Hugging Face Back to the gallery