Coding & Software

Multi-SWE-bench

Multi-SWE-bench: multilingual issue-resolving benchmark over 8 languages (Python, Java, TS, JS, Go, Rust, C, C++). Per-instance resolved/unresolved verdicts for 39+ agent x model submissions, mirrored into a binary matrix.

2,078items

82subjects

33%observed

Apache-2.0license

software_engineeringdomain

multilingualdomain

textmodality

Original source Paper Build script ← All benchmarks

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 82 subjects × 2,078 items, 33% of cells evaluated.

Multi-SWE-bench response matrix: AI models (rows) against items (columns) — Correct (1)Incorrect (0)Unobserved
Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rate

Svelte 5 migration: event modifier transformer doesn't respect indentation

Describe the bug

This...

<script>
  let count = 0;
</script>

<div>
  <div>
    <div>
      <div>
        <button on:click|preventDefault={() => count += 1}>
          clicks: {count}
        </button>
      </div>
    </div>
  </div>
</div>

...becomes this:

<script>
  let count = $state(0);
</script>

<div>
  <div>
    <div>
      <div>
        <button onclick={(event) => {
  event.preventDefault();
  count += 1
}}>
          clicks: {count}
        </button>
      </div>
    </div>
  </div>
</div>

Reproduction

demo

Subject outcomes

MSWE-agent+Claude-3.5-Sonnet(Oct) incorrect
MSWE-agent+Claude-3.7-Sonnet incorrect
MSWE-agent+DeepSeek-R1 incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect
RepoRepair+Claude-4.5-Sonnet incorrect

Item 20% solve rate

Internationalisation didn't support language locale containing both script and region. Description

The i18n_patterns didn't work with locale contains both script and region, like en-latn-us. Given settings.py LANGUAGE_CODE = 'en-us' LANGUAGES = [ ('en-us', "English"), ('en-latn-us', "Latin English"), ('en-Latn-US', "BCP 47 case format"), ] urls.py from django.conf.urls.i18n import i18n_patterns from django.http import HttpResponse def bangiah(request): return HttpResponse('U!') urlpatterns += i18n_patterns( path('', bangiah), ) The response of http://localhost:8000/en-us/ is 200 U!. The response of http://localhost:8000/en-lat-us/ is 404 not found. The response of http://localhost:8000/en-Latn-US/ is 404 not found. Steps to Reproduce Start a new project with django-admin startproject tshi and cd tshi/ Append to tshi/settings.py as follows LANGUAGES = [ ('en-us', "English"), ('en-latn-us', "Latin English"), ('en-Latn-US', "BCP 47 case format"), ] MIDDLEWARE += [ 'django.middleware.locale.LocaleMiddleware', ] Edit tshi/urls.py by appending follows from django.conf.urls.i18n import i18n_patterns from django.http import HttpResponse def bangiah(request): return HttpResponse('U!') urlpatterns += i18n_patterns( path('', bangiah), ) python manage.py migrate python manage.py runserver The results The response of http://localhost:8000/en-us/ is 200 U!. The response of http://localhost:8000/en-lat-us/ is 404 not found. The response of http://localhost:8000/en-Latn-US/ is 404 not found. Expect to happen instead The response of http://localhost:8000/en-latn-us/ and http://localhost:8000/en-Latn-US/ should be 200 U!. The en-Latn-US tag follows format defined in RFC 5646. It's documented that the language part is always in lowercase, following Accept-Language. Accept-Language is following Content-Language Header, which is following RFC 5646. The RFC 5646 defined langtag as follow: langtag = language ["-" script] ["-" region] ("-" variant) ("-" extension) ["-" privateuse] language = 23ALPHA ; shortest ISO 639 code ["-" extlang] ; sometimes followed by ; extended language subtags / 4ALPHA ; or reserved for future use / 58ALPHA ; or registered language subtag extlang = 3ALPHA ; selected ISO 639 codes *2("-" 3ALPHA) ; permanently reserved script = 4ALPHA ; ISO 15924 code region = 2ALPHA ; ISO 3166-1 code / 3DIGIT ; UN M.49 code I have confirmed that this issue can be reproduced as described on a fresh Django project Python version: 3.7.5 Django version: 3.2.7

Subject outcomes

Agentless+Claude-3.5-Sonnet(Oct) incorrect
Agentless+Claude-3.7-Sonnet incorrect
Agentless+DeepSeek-R1 incorrect
OpenHands+Llama-4-Maverick incorrect
SWE-agent+Doubao-1.5-thinking incorrect
SWE-agent+Gemini-2.5-Pro incorrect

Item 30% solve rate

[MenuButton][base] Create the MenuButtonUnstyled component After creating demos for #30961 it became clear that leaving implementation of menu buttons to developers would force them to write a lot of code. We can create an abstraction for a button that triggers the appearance of a menu and responds to keyboard input (in a slightly different way than a normal button - pressing up/down arrow keys should also open the menu).

Bonus points for making it work with SelectUnstyled.

Note: make sure that clicking on the button when menu is open does not cause blinking, as it's currently the case in MenuUnstyled demos (see https://github.com/mui/material-ui/pull/32661#issue-1228340370, point 1)

Subject outcomes

MSWE-agent+Claude-3.7-Sonnet incorrect
MSWE-agent+DeepSeek-R1 incorrect
MSWE-agent+DeepSeek-V3 incorrect
MopenHands+Doubao-1.5-thinking incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 40% solve rate

[material] Invalid color prop has no effect

[X] I have searched the existing issues
[X] I have tested the latest version

Steps to reproduce 🕹

Link to live example: CodeSandbox fork based on the Typography demo from the docs

Open CodeSandbox fork
Observe invalid color prop to mui.Typography has no effect

Current behavior 😯

Invalid color prop has absolutely no effect:

no warnings or errors
no type checking
no invalid CSS (at least it would serve as an indicator to the developer)

Expected behavior 🤔

In NODE_ENV=development or using an optional flag to mui.createTheme, etc.

we should tell the developer there's an invalid color prop

Proposal

<details> <summary>Here's what we use internally</summary>

import * as mui from "@mui/material"

import { palettes } from "../options/palette"

let validColors: string[] | undefined
/**
 * @__NO_SIDE_EFFECTS__
 */
export const isColorValid = /* @__PURE__ */ (color?: unknown) => {
  if (process.env.NODE_ENV === `production`) return
  if (typeof color !== `string`) return

  if (!validColors) {
    const tones = Object.keys({
      main: true,
      light: true,
      dark: true,
      contrastText: true,
    } satisfies Record<keyof mui.SimplePaletteColorOptions, true>)

    const colors = Object.keys({
      primary: true,
      secondary: true,
      error: true,
      warning: true,
      info: true,
      success: true,
    } satisfies Record<ColorWithTones, true>)

    const text = Object.keys({
      disabled: true,
      primary: true,
      secondary: true,
    } satisfies Record<keyof mui.TypeText, true>)

    const background = Object.keys({
      default: true,
      paper: true,
      ground: true,
    } satisfies Record<keyof mui.TypeBackground, true>)

    /**
     * Sometimes, we want to let the user to a color that is not in the palette (theme)
     */
    const validStaticColors = [`white`]

    /**
     * A user can use a literal color, by using "mui.useTheme" and then pass a literal color
     */
    const literalThemeColors = Object.keys(palettes).flatMap((paletteName) => {
      const palette = palettes[paletteName]
      const literals = new Set<string>() // to avoid duplicates
      for (const key of Object.keys(palette)) {
        const value = palette[key]
        if (typeof value === `string`) {
          literals.add(value)
          continue
        }

        for (const valueKey of Object.keys(value)) {
          const nestedValue = value[valueKey]
          if (typeof nestedValue === `string`) {
            literals.add(nestedValue)
            continue
          }
        }
      }
      return [...literals]
    })

    validColors = [
      ...validStaticColors,
      ...literalThemeColors,
      `primary`,
      `secondary`,
      ...background.map((tone) => `background.${tone}`),
      ...text.map((tone) => `text.${tone}`),
      ...colors.flatMap((color) => tones.map((tone) => `${color}.${tone}`)),
    ]
  }

  if (!validColors.includes(color)) {
    throw new Error(
      `Invalid color: "${color}"\n` +
        `Valid colors are: ${validColors.join(`, `)}`,
    )
  }
}

</details>

Subject outcomes

MSWE-agent+Claude-3.5-Sonnet(Oct) incorrect
MSWE-agent+Claude-3.7-Sonnet incorrect
MSWE-agent+DeepSeek-R1 incorrect
MopenHands+Doubao-1.5-thinking incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 50% solve rate

Default theme info is still printed on piped stdout I believe this is the same issue reported in #3073 and apparently fixed in #3075.

Am I doing something wrong?

What steps will reproduce the bug?

Running $ bat --no-config --list-themes | cat

The --no-config part is optional, it's just to clear my settings for this run.

What happens?

This is the output:

1337
Coldark-Cold
Coldark-Dark
DarkNeon
Dracula
GitHub
Monokai Extended (default dark)
Monokai Extended Bright
Monokai Extended Light (default light)
Monokai Extended Origin
Nord
OneHalfDark
OneHalfLight
Solarized (dark)
Solarized (light)
Sublime Snazzy
TwoDark
Visual Studio Dark+
ansi
base16
base16-256
custom16
gruvbox-dark
gruvbox-light
zenburn

Monokai Extended and Extended Light include default theme annotations.

What did you expect to happen instead?

The same list / output but without (default dark) and (default light) information.

How did you install bat?

Via Cargo.

Side note

Probably unrelated but when I run $ bat --list-themes --color=never I get the same output but with (default) instead of (default dark).

bat version and environment

Software version

bat 0.25.0

Operating system

OS: Linux (Ubuntu 23.10)
Kernel: 6.5.0-44-generic

Command-line

bat --diagnostic

Environment variables

BAT_CACHE_PATH=<not set>
BAT_CONFIG_PATH=<not set>
BAT_OPTS=<not set>
BAT_PAGER='less -R'
BAT_PAGING=<not set>
BAT_STYLE=<not set>
BAT_TABS=<not set>
BAT_THEME=<not set>
COLORTERM=truecolor
LANG=en_US.UTF-8
LC_ALL=<not set>
LESS=<not set>
MANPAGER='sh -c '\''col -bx | bat -p --language=man --theme=custom16'\'''
NO_COLOR=<not set>
PAGER=less
SHELL=/usr/bin/zsh
TERM=xterm-256color
XDG_CACHE_HOME=<not set>
XDG_CONFIG_HOME=<not set>

System Config file

Could not read contents of '/etc/bat/config': No such file or directory (os error 2).

Config file

# This is `bat`s configuration file. Each line either contains a comment or
# a command-line option that you want to pass to `bat` by default. You can
# run `bat --help` to get a list of all possible configuration options.

--theme="Dracula"
--italic-text=always
--color=always

Custom assets metadata

bat_version: 0.25.0
creation_time:
  secs_since_epoch: 1736608946
  nanos_since_epoch: 486389724

Custom assets

metadata.yaml, 97 bytes
syntaxes.bin, 973899 bytes
themes.bin, 41464 bytes

Compile time information

Profile: release
Target triple: x86_64-unknown-linux-gnu
Family: unix
OS: linux
Architecture: x86_64
Pointer width: 64
Endian: little
CPU features: fxsr,sse,sse2
Host: x86_64-unknown-linux-gnu

Less version

> less --version 
less 590 (GNU regular expressions)
Copyright (C) 1984-2021  Mark Nudelman

less comes with NO WARRANTY, to the extent permitted by law.
For information about the terms of redistribution,
see the file named README in the less distribution.
Home page: https://greenwoodsoftware.com/less

Subject outcomes

MSWE-agent+Claude-3.5-Sonnet(Oct) incorrect
MSWE-agent+Claude-3.7-Sonnet incorrect
MSWE-agent+DeepSeek-R1 incorrect
MopenHands+Doubao-1.5-thinking incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 60% solve rate

Factor with extension=True drops a factor of y-1 I guess this related (or a duplicate of?) #5786

This is from stackoverflow: https://stackoverflow.com/questions/60682765/python-sympy-factoring-polynomial-over-complex-numbers

In [9]: z = expand((x-1)*(y-1))                                                                                                                

In [10]: z                                                                                                                                     
Out[10]: x⋅y - x - y + 1

In [11]: factor(z)                                                                                                                             
Out[11]: (x - 1)⋅(y - 1)

In [12]: factor(z, extension=[I])                                                                                                              
Out[12]: x - 1

Factor with extension=True drops a factor of y-1

Factor with extension=True drops a factor of y-1

References to other Issues or PRs

Fixes #18895

Brief description of what is fixed or changed

Other comments

Release Notes

NO ENTRY

Subject outcomes

Agentless+Claude-3.5-Sonnet(Oct) incorrect
Agentless+Claude-3.7-Sonnet incorrect
Agentless+DeepSeek-R1 incorrect
OpenHands+Gemini-2.5-Pro incorrect
OpenHands+Llama-4-Maverick incorrect
SWE-agent+Doubao-1.5-thinking incorrect

Item 70% solve rate

Svelte 5: error/warning follow-up tasks

Describe the problem

Just jotting down a few thoughts on follow-ups to #11294, #11302, #11303 and #11304:

[x] finish porting all runtime errors
[x] use blockquote syntax for existing messages — i.e. put > before everything that isn't a header. The reason for this is that we can provide excessive detail immediately below the blockquote, but (for example) only show it in the docs. It also enables...
[x] ...overloads. In a few cases we have situations like 'did you mean <fuzzymatch>?' — short of inventing a convoluted new syntax this sort of thing is trickier to accommodate in markdown. But I think we could get the same benefits by overloading messages — if we have something like this...
```
## some_error_code

> This is the first message: %message%

> This is the second message: %message%. It has additional details: %details%

This is a long-winded explanation of the two shorter messages above; it does not have parameters, and will be used in the docs
```
...then we could choose which summary message to use based on the function arity. The alternative is to continue having multiple error/warning codes for these situations, but that kinda sucks
[x] sort out the messages themselves — there's lots of weird codes, messages that could be improved, duplicative stuff and so on
[x] add them to the docs
[ ] link to the docs from the console
[ ] add details to more messages

Subject outcomes

MSWE-agent+Claude-3.5-Sonnet(Oct) incorrect
MSWE-agent+DeepSeek-R1 incorrect
MSWE-agent+Doubao-1.5-pro incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect
RepoRepair+Claude-4.5-Sonnet incorrect

Item 84% solve rate

Type check failed when a prop is defined as keyof ...

Vue version

3.4.26

Link to minimal reproduction

https://play.vuejs.org/#eNqFUstOwzAQ/BXLJ5BQcoBTCJUA9QCHtgKOvrjOJrh1bMt2SiHKv7N2SSnPnhLPzuyOd9zTa2uzTQe0oKUXTtpAFNfNFaPBM0o8hM5OmCZEtta4QG5Na0ntTEsYzfJ4impGL5ku810DpOMhQGsVD5DEZZJp3gI2ro1hNEe8zA9I9AwnCqNr2WQrbzQa6qOUUYFaqcDNbZBGo6uCpEqscaXMy33CguvgbMTFM4j1L/jKbyPG6MKBB7dB5/ta4K6BsCtPH2ewxf99sTVVp5D9T/EBvFFd9Lij3XS6QtsHvOT2Li1S6ubJT7cBtB8vFY1G5pD4jOJe49r+uvqn3fPsIumYHnCLYybHE9UBXM0FkBkGM1+uQISxPWZUEB8c2sRkI7Lk7gfy9gXB+fFTQS01LJyxvvzoFoMvyBpeTX0wK2kmJ6dHnk4lN5O+Tz3IMJR5PH9/O8M7nkLsfg==

Steps to reproduce

直接看报错

What is expected?

keyof 返回的是 string | number 的联合类型，类型校验应该通过

What is actually happening?

Invalid prop: type check failed for prop "name". Expected Object, got String with value "foo".

System Info

System:
  OS: Windows 10 10.0.19045
  CPU: (12) x64 11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz
  Memory: 6.45 GB / 15.87 GB
Binaries:
  Node: 21.7.1 - D:\Program Files\nodejs\node.EXE
  Yarn: 1.22.22 - D:\Program Files\node\node_global\yarn.CMD
  npm: 10.5.2 - D:\Program Files\nodejs\npm.CMD
  pnpm: 9.0.6 - D:\Program Files\node\node_global\pnpm.CMD
Browsers:
  Edge: Chromium (123.0.2420.97)
  Internet Explorer: 11.0.19041.3636
npmPackages:
  vue: ^3.4.26 => 3.4.26

Any additional comments?

No response

Subject outcomes

RepoRepair+Claude-3.5-Sonnet(Oct) correct
MSWE-agent+DeepSeek-V3 incorrect
MSWE-agent+OpenAI-o3-mini-high incorrect
MopenHands+Doubao-1.5-thinking incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 99% solve rate

ZSTD_CCtxParams functions We have functions prefixed with ZSTD_CCtxParams_ and ZSTD_CCtxParam_, we should make this consistent.

Subject outcomes

MopenHands+Doubao-1.5-thinking correct
CodeArts-Agent+CodeArts-GLM-5.1 correct
MSWE-agent+Claude-3.5-Sonnet(Oct) incorrect
MagentLess+Gemini-2.5-Pro incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 1018% solve rate

args_conflicts_with_subcommands does not overriding requireds on arguments Maintainer's notes

Normally, conflicts are two way and they override required(true). We aren't doing that with args_conflicts_with_subcommands while it can be worked around with subcommand_negates_reqs. The main question is whether to consider this a breaking change or not.

Discussed in https://github.com/clap-rs/clap/discussions/3892

<sup>Originally posted by sasial-dev July 1, 2022</sup> How would I not have -f & -p in the subcommands?

    let cli = command!("edit-place")
        .propagate_version(true)
        .arg_required_else_help(true)
        .subcommand(
            command!("config")
                .about("Edit the favourites config")
                .subcommand_required(true)
                .subcommand(command!("add").about("Add a favourite"))
                .subcommand(command!("list").about("List all favourites"))
                .subcommand(command!("remove").about("Remove a favourite")),
        )
        .arg(arg!(-p --place <"place id"> "Place ID to open").global(false))
        .arg(arg!(-f --favourite <"favourite name"> "Favourite place to open").global(false))
        .group(
            ArgGroup::new("id")
                .required(true)
                .args(&["place", "favourite"]),
        )
        .get_matches();
```</div>

Subject outcomes

MopenHands+OpenAI-o3-mini-high correct
MopenHands+Claude-3.7-Sonnet correct
MopenHands+Claude-3.5-Sonnet(Oct) correct
MopenHands+Doubao-1.5-thinking incorrect
MopenHands+Gemini-2.5-Pro incorrect
MopenHands+Llama-4-Maverick incorrect

Item 1135% solve rate

locale/<language>/LC_MESSAGES/sphinx.po translation ignored Describe the bug I read [1] as it should be possible to add a file locale/<language>/LC_MESSAGES/sphinx.mo to the source dir (same dir as the Makefile) and through that change translations or add additional translation to <language>.

When I add locale/da/LC_MESSAGES/sphinx.po, with updated entries for Fig. %s and Listing %s, a locale/da/LC_MESSAGES/sphinx.mo is created (because of gettext_auto_build = True), but the translations are not used. The translations from the official da translation [2] is used. Of course language = 'da' is in conf.py.

[1] http://www.sphinx-doc.org/en/master/usage/configuration.html#confval-locale_dirs [2] https://github.com/sphinx-doc/sphinx/blob/master/sphinx/locale/da/LC_MESSAGES/sphinx.po

To Reproduce Steps to reproduce the behavior:

$ git clone https://github.com/jonascj/sphinx-test-locale-override.git
$ cd sphinx-test-locale-override
$ git checkout 8dea4cd # EDIT: current master showcases workaround, so revert back to see the bug
$ # make python venv however you like
$ pip install sphinx
$ make html

Notice that locale/da/LC_MESSAGES/sphinx.mo has been created. Open _build/html/index.html.

Expected behavior The caption label for the figure figur 1 should have been Foobar 1 (for the sake of testing) and the caption label for the code block Viser 1 should have been Whatever 1 (again for the sake of testing).

Your project https://github.com/jonascj/sphinx-test-locale-override.git

Screenshots Screenshot of index.html

Environment info

OS: Arch Linux
Python version: 3.7.3
Sphinx version: 2.1.2
Sphinx extensions: none
Extra tools: none

Subject outcomes

Agentless+Claude-3.7-Sonnet correct
Agentless+DeepSeek-V3 correct
OpenHands+OpenAI-o1 correct
SWE-agent+Doubao-1.5-thinking incorrect
SWE-agent+Gemini-2.5-Pro incorrect
SWE-agent+Llama-4-Maverick incorrect

Item 1260% solve rate

distance calculation wrong

>>> Point(2,0).distance(Point(1,0,2))
1

The 3rd dimension is being ignored when the Points are zipped together to calculate the distance so sqrt((2-1)**2 + (0-0)**2) is being computed instead of sqrt(5).

Subject outcomes

Agentless+Claude-3.7-Sonnet correct
Agentless+DeepSeek-R1 correct
Agentless+DeepSeek-V3 correct
Agentless+Llama-4-Maverick incorrect
OpenHands+Doubao-1.5-thinking incorrect
OpenHands+Llama-4-Maverick incorrect

Subjects

The models, agents, and reward models evaluated.

82 subjects, ranked by mean response (accuracy) across this benchmark's items.

1OpenHands+Claude-3.7-Sonnet0.5711
2CodeArts-Agent+CodeArts-GLM-5.10.5433
3Agentless+Gemini-2.5-Pro0.5213
4SWE-agent+Claude-3.7-Sonnet0.4978
5Agentless+OpenAI-o3-mini-high0.4926
6CodeArts-Agent+CodeArts-MiniMax-M2.50.4912
7Agentless+OpenAI-o10.4898
8Agentless+Claude-3.7-Sonnet0.4636
9OpenHands+Gemini-2.5-Pro0.458
10Agentless+Doubao-1.5-thinking0.4571
11Agentless+DeepSeek-R10.4518
12OpenHands+Claude-3.5-Sonnet(Oct)0.4362
13Agentless+Claude-3.5-Sonnet(Oct)0.4283
14SWE-agent+Gemini-2.5-Pro0.4212
15Agentless+DeepSeek-V30.4192
16SWE-agent+Claude-3.5-Sonnet(Oct)0.4133
17RepoRepair+Claude-3.5-Sonnet(Oct)0.4122
18Agentless+Doubao-1.5-pro0.3982
19SWE-agent+OpenAI-o10.3945
20InfCode+GPT-5.20.3906
21Agentless+Llama-4-Maverick0.3818
22SWE-agent+Doubao-1.5-thinking0.3696
23Agentless+GPT-4o-11200.3686
24SWE-agent+OpenAI-o3-mini-high0.3557
25OpenHands+DeepSeek-R10.3485
26iSWE+Agent0.3386
27iSWE-OpenModels0.3125
28OpenHands+DeepSeek-V30.3042
29Agentless+Qwen2.5-72B-Instruct0.3039
30SWE-agent+Doubao-1.5-pro0.2952
31OpenHands+GPT-4o-11200.2936
32OpenHands+Doubao-1.5-thinking0.2802
33MSWE-Agent+CodeArts-MiniMax-M2.50.28
34OpenHands+OpenAI-o3-mini-high0.2636
35SWE-agent+GPT-4o-11200.2568
36InfCode+GPT-50.2558

Full data on Hugging Face Back to the gallery