Skip to main content

Coding & Software

SWE-bench Verified

Build the SWE-bench Verified response matrix from the experiments repo

500items
134subjects
100%observed
Agentsubject type
MITlicense
software_engineeringdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 134 subjects × 500 items, 100% of cells evaluated. The heatmap shows a representative 402 of 500 items — evenly sampled across difficulty — so each cell stays square and legible.

SWE-bench Verified response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate
Ordering problem in admin.RelatedFieldListFilter and admin.RelatedOnlyFieldListFilter
Description
	
RelatedFieldListFilter doesn't fall back to the ordering defined in Model._meta.ordering. 
Ordering gets set to an empty tuple in ​https://github.com/django/django/blob/2.2.1/django/contrib/admin/filters.py#L196 and unless ordering is defined on the related model's ModelAdmin class it stays an empty tuple. IMHO it should fall back to the ordering defined in the related model's Meta.ordering field.
RelatedOnlyFieldListFilter doesn't order the related model at all, even if ordering is defined on the related model's ModelAdmin class.
That's because the call to field.get_choices ​https://github.com/django/django/blob/2.2.1/django/contrib/admin/filters.py#L422 omits the ordering kwarg entirely.

Subject outcomes

  • 20231010_rag_claude2 incorrect
  • 20250522_sweagent_claude-4-sonnet-20250514 incorrect
  • 20241029_epam-ai-run-claude-3-5-sonnet incorrect
Item 213% solve rate
AttributeError with cross_val_predict(method='predict_proba') when using MultiOuputClassifier
#### Description
I believe there is a bug when using `cross_val_predict(method='predict_proba')` with a `MultiOutputClassifer`. 

I think the problem is in the use of `estimator.classes_` here:
https://github.com/scikit-learn/scikit-learn/blob/3be7110d2650bbe78eda673001a7adeba62575b0/sklearn/model_selection/_validation.py#L857-L866

To obtain the `classes_` attribute of a `MultiOutputClassifier`, you need `mo_clf.estimators_[i].classes_` instead.

If core team members have any idea of how to address this, I am happy to submit a patch. 

#### Steps/Code to Reproduce

```python
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import cross_val_predict

X, Y = make_multilabel_classification()

mo_lda = MultiOutputClassifier(LinearDiscriminantAnalysis())
pred = cross_val_predict(mo_lda, X, Y, cv=5) # Works fine
pred_proba =  cross_val_p …

Subject outcomes

  • 20251215_livesweagent_claude-opus-4-5 correct
  • 20250929_Prometheus_v1.2_gpt5 correct
  • 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 incorrect
Item 345% solve rate
[ENH]: Add get/set_antialiased to Text objects
### Problem

Currently, Text objects always retrieve their antialiasing state via the global rcParams["text.antialias"], unlike other artists for which this can be configured on a per-artist basis via `set_antialiased` (and read via `set_antialiased`).

### Proposed solution

Add similar getters/setters on Text objects (also adjusting Annotations accordingly, if needed) and use that info in the drawing stage.

Should be relatively easy to implement, except that the slight fiddling needed with backends requires some understanding of backend code (I think we need to replace the access to `rcParams["text.antialiased"]` by going through the GraphicsContext state).

Subject outcomes

  • 20241108_devlo correct
  • 20250629_deepswerl_r2eagent_tts correct
  • 20241025_composio_swekit incorrect
Item 468% solve rate
Subclassed SkyCoord gives misleading attribute access message
I'm trying to subclass `SkyCoord`, and add some custom properties. This all seems to be working fine, but when I have a custom property (`prop` below) that tries to access a non-existent attribute (`random_attr`) below, the error message is misleading because it says `prop` doesn't exist, where it should say `random_attr` doesn't exist.

```python
import astropy.coordinates as coord


class custom_coord(coord.SkyCoord):
    @property
    def prop(self):
        return self.random_attr


c = custom_coord('00h42m30s', '+41d12m00s', frame='icrs')
c.prop
```

raises
```
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    c.prop
  File "/Users/dstansby/miniconda3/lib/python3.7/site-packages/astropy/coordinates/sky_coordinate.py", line 600, in __getattr__
    .format(self.__class__.__name__, attr))
AttributeError: 'custom_coord' object has no attribute 'prop'
```

Subject outcomes

  • 20250226_swerl_llama3_70b correct
  • 20250511_sweagent_lm_32b correct
  • 20250805_openhands-Qwen3-Coder-30B-A3B-Instruct incorrect
Item 584% solve rate
URLValidator tests failing on Python versions patched for bpo-43882
Description
	
On Python versions with a fix for ​bpo-43882 (i.e. 3.10.0b1 and the 3.9 git branch, not released yet) the following tests fail:
======================================================================
FAIL: test_validators (validators.tests.TestValidators) [URLValidator] (value='http://www.djangoproject.com/\n')
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/usr/lib/python3.7/unittest/case.py", line 59, in testPartExecutor
	yield
 File "/usr/lib/python3.7/unittest/case.py", line 546, in subTest
	yield
 File "/tmp/portage/dev-python/django-3.2.1/work/Django-3.2.1/tests/validators/tests.py", line 328, in test_validators
	validator(value)
 File "/usr/lib/python3.7/unittest/case.py", line 203, in __exit__
	self._raiseFailure("{} not raised".format(exc_name))
 File "/usr/lib/python3.7/unittest/case.py", line 135, in _raiseFailure
	raise self.test_case.failureException(msg)
AssertionError: ValidationError not raised
============================= …

Subject outcomes

  • 20251215_livesweagent_claude-opus-4-5 correct
  • 20250901_warp correct
  • 20250112_ugaiforge incorrect
Item 696% solve rate
ManagementUtility instantiates CommandParser without passing already-computed prog argument
Description
	
ManagementUtility ​goes to the trouble to parse the program name from the argv it's passed rather than from sys.argv: 
	def __init__(self, argv=None):
		self.argv = argv or sys.argv[:]
		self.prog_name = os.path.basename(self.argv[0])
		if self.prog_name == '__main__.py':
			self.prog_name = 'python -m django'
But then when it needs to parse --pythonpath and --settings, it ​uses the program name from sys.argv: 
		parser = CommandParser(usage='%(prog)s subcommand [options] [args]', add_help=False, allow_abbrev=False)
Above "%(prog)s" ​refers to sys.argv[0]. Instead, it should refer to self.prog_name. This can fixed as follows:
		parser = CommandParser(
			prog=self.prog_name,
			usage='%(prog)s subcommand [options] [args]',
			add_help=False,
			allow_abbrev=False)
I'm aware that execute_from_command_line is a private API, but it'd be really convenient for me if it worked properly in my weird embedded environment where sys.argv[0] is ​incorrectly None. If passing my own argv to ex …

Subject outcomes

  • 20231010_rag_claude2 correct
  • 20250522_tools_claude-4-opus correct
  • 20240509_amazon-q-developer-agent-20240430-dev incorrect

Subjects

The models, agents, and reward models evaluated.

134 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 120251205_sonar-foundation-agent_claude-opus-4-50.792
  2. 220251215_livesweagent_claude-opus-4-50.792
  3. 320250928_trae_doubao_seed_code0.788
  4. 420251127_openhands_claude-opus-4-50.776
  5. 520251120_livesweagent_gemini-3-pro-preview0.774
  6. 620250902_atlassian-rovo-dev0.768
  7. 720250804_epam-ai-run-claude-4-sonnet0.768
  8. 820250819_ACoder0.764
  9. 920250901_warp0.756
  10. 1020250612_trae0.752
  11. 1120251103_sonar-foundation-agent_claude-sonnet-4-50.748
  12. 1220250731_harness_ai0.748
  13. 1320250720_Lingxi-v1.5_claude-4-sonnet-202505140.746
  14. 1420250915_JoyCode0.746
  15. 1520251015_Prometheus_v1.2.1_gpt50.744
  16. 1620250603_Refact_Agent_claude-4-sonnet0.744
  17. 1720251103_SalesforceAIResearch_SAGE_OpenHands0.738
  18. 1820250522_tools_claude-4-opus0.732
  19. 1920251021_SalesforceAIResearch_SAGE_bash_only0.73
  20. 2020250522_tools_claude-4-sonnet0.724
  21. 2120250807_openhands_gpt50.718
  22. 2220250715_qodo_command0.712
  23. 2320250710_bloop0.712
  24. 2420251014_Lingxi_kimi_k20.712
  25. 2520250929_Prometheus_v1.2_gpt50.712
  26. 2620250623_warp0.71
  27. 2720250611_moatless_claude-4-sonnet-202505140.708
  28. 2820250519_trae0.706
  29. 2920250610_augment_agent_v10.704
  30. 3020250515_Refact_Agent0.704
  31. 3120250524_openhands_claude_4_sonnet0.704
  32. 3220250519_devlo0.702
  33. 3320250430_zencoder_ai0.7
  34. 3420250805_openhands-Qwen3-Coder-480B-A35B-Instruct0.696
  35. 3520250930_zai_glm4-60.682
  36. 3620250516_cortexa_o30.682

+ 98 more subjects evaluated.