Skip to main content

Human Preference & Alignment

MT-Bench

MT-Bench GPT-4 single-answer judgment scores (34 models, 80 questions, 2 turns).

160items
34subjects
100%observed
Modelsubject type
CC-BY-4.0license
preferencedomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 34 subjects × 160 items, 100% of cells evaluated. The heatmap shows a representative 102 of 160 items — evenly sampled across difficulty — so each cell stays square and legible.

MT-Bench response matrix: AI models (rows) against items (columns)
lowhighUnobserved

Scale: GPT-4 judge rating, 1 (worst) to 10 (best)

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 1103% solve rate
What's area of the circle circumscribing the triangle?

How subjects answered

  • Llama-2-7b-chatscore 2

    The assistant's response is incorrect. The assistant incorrectly states that the vertices of the triangle are not in the same plane, which is not true. The vertices are in the same plane, but they are collinear, meaning they lie on the same line. Therefore, they do not form a triangle. The assistant also incorrectly states that it is not possible to draw a circle with the given vertices because they are not in the same plane. The correct reason is that since the points are collinear, they do not form a triangle, and therefore, there is no circumscribed circle. The assistant should have explain …

  • alpaca-13bscore 1

    The assistant's response is incorrect. The assistant stated that the area of the triangle is 6 square units and the area of the circumscribing circle is 12 square units. However, the reference answer correctly states that the three points are collinear and do not form a triangle. Therefore, there is no circumscribed circle for this set of points. A circumscribed circle can only be formed around a triangle with non-collinear points. The assistant's response is not only incorrect but also misleading. Rating: [[1]]

  • chatglm-6bscore 1

    The assistant's response to the second question is incorrect. The assistant incorrectly states that the area of the circumscribed circle can be calculated by subtracting the area of the triangle's base from 1, which is not a valid method. Furthermore, the assistant's calculation of the area of the triangle in the first response is also incorrect. The assistant should have noted that the points are collinear and do not form a triangle, and therefore, there is no circumscribed circle. The correct response should have been that since the three points are collinear and do not form a triangle, ther …

Item 2274% solve rate
Given that f(x) = 4x^3 - 9x - 14, find the value of f(2).

How subjects answered

  • palm-2-chat-bison-001score 10

    The assistant's answer is correct and matches the reference answer. The assistant correctly substituted x = 2 into the function f(x) = 4x^3 - 9x - 14 and calculated the value of f(2) as 0. The assistant's response is concise and to the point, providing the necessary steps and the final answer. Therefore, no corrections are needed. Rating: [[10]]

  • claude-v1score 10

    The assistant's answer is correct and matches the reference answer. The assistant correctly substituted 2 for x in the function f(x) = 4x^3 - 9x - 14 and simplified the expression to find that f(2) = 0. The assistant's explanation is clear and easy to follow, making it helpful for the user. Rating: [[10]]

  • chatglm-6bscore 1

    The assistant's answer is incorrect. The assistant made a mistake in the calculation of the function f(2). The correct calculation should be: f(2) = 4(2)^3 - 9(2) - 14 f(2) = 4(8) - 18 - 14 f(2) = 32 - 18 - 14 f(2) = 14 - 14 f(2) = 0 The assistant incorrectly calculated 4(2)^3 as 8 instead of 32, which led to the wrong final result. The correct answer is 0, not -5. Therefore, the assistant's response is not helpful or correct. Rating: [[1]]

Item 3547% solve rate
If the startup maintains the same strategy for the third year, investing half of the previous year's amount into software development, how much will they invest in the third year?

How subjects answered

  • mpt-30b-instructscore 10

    The assistant's answer is correct. The assistant correctly calculated that the startup would invest $2000 in software development in the third year, which is half of the second year's investment. This matches the reference answer. The assistant's response is clear, concise, and directly answers the user's question. Rating: [[10]]

  • gpt-4score 10

    The assistant's answer is correct and matches the reference answer. The assistant correctly calculated that the startup will invest $2000 in the third year, which is half of the previous year's investment of $4000. The assistant's explanation is clear and easy to understand. Therefore, there are no mistakes to correct in the assistant's response. Rating: [[10]]

  • h2ogpt-oasst-open-llama-13bscore 1

    The assistant's response to the second question is incorrect. The user asked how much the startup would invest in the third year if they continued to invest half of the previous year's amount. The assistant incorrectly stated that the startup would invest $8,000 in the third year. However, according to the user's strategy, the startup should invest half of the second year's amount, which is $4,000. Therefore, the correct answer should be $4,000 / 2 = $2,000. The assistant's response is not helpful because it provides incorrect information. Corrected Assistant's Answer: In the third year, the …

Item 4709% solve rate
How to use CSS to change the color of jokes to red?

How subjects answered

  • gpt-3.5-turboscore 10

    The assistant's answer is correct and helpful. It accurately explains how to use CSS to change the color of the jokes to red. The assistant provides a clear and detailed explanation, and also provides an updated HTML code snippet that includes the CSS rule to change the color of the jokes to red. The assistant's answer matches the reference answer in terms of correctness and helpfulness. There are no mistakes to correct in the assistant's response. Rating: [[10]]

  • vicuna-33b-v1.3score 10

    The assistant's answer is correct and helpful. The assistant correctly explained how to change the color of the jokes to red using CSS. The assistant provided the modified HTML code with the `color: red;` property added to the `#joke` `<div>` element in the `<style>` section, which is the correct way to change the text color of the jokes to red. The assistant's answer is consistent with the reference answer. Therefore, there are no mistakes to correct. Rating: [[10]]

  • alpaca-13bscore 2

    The assistant's response is incorrect. The assistant suggests using a CSS rule for an element called "jokes", which doesn't exist in the HTML code provided. The correct way to change the color of the jokes to red is to add a CSS rule for the paragraph element where the joke is displayed. This can be done by adding a rule for the `#jokeDisplay` element to set the `color` property to red. Here is the corrected version of the assistant's response: ```css #jokeDisplay { color: red; } ``` This rule should be added to the `<style>` section of the HTML code. Now, when a joke is displayed, its t …

Item 5813% solve rate
Share ideas for adapting art masterpieces into interactive experiences for children. List 5 specific artworks and associated ideas.

How subjects answered

  • llama-2-70b-chatscore 10

    The assistant's response is highly relevant, creative, and detailed. It provides specific ideas for five different masterpieces, incorporating various technologies like virtual reality, augmented reality, and mixed reality to create interactive experiences for children. The assistant also explains how these experiences can be both engaging and educational, fostering children's curiosity and creativity while teaching them about art and history. The response is accurate and demonstrates a deep understanding of the question. Therefore, the assistant's response is of high quality. Rating: [[10]]

  • gpt4all-13b-snoozyscore 10

    The assistant's response is highly relevant, creative, and detailed. It provides specific ideas for interactive experiences based on five different art masterpieces, as requested by the user. The ideas are varied and engaging, involving different technologies like virtual reality, digital art installations, and mobile apps. The assistant also demonstrates a good understanding of the artworks mentioned, as the suggested experiences are closely related to the themes and elements of each painting. The response is also accurate, as it correctly identifies the artists of the mentioned artworks. Ove …

  • stablelm-tuned-alpha-7bscore 3

    The assistant's response is not very helpful or creative. While it does list five artworks, it repeats "The Scream" by Edvard Munch twice, which shows a lack of attention to detail. Furthermore, the ideas for interactive experiences are all the same - children creating their own version of the painting. This lacks creativity and does not fully engage children in the art or its history. The assistant could have suggested more interactive and engaging activities, such as creating a story based on the painting, acting out a scene from the painting, or using technology to create a 3D version of th …

Item 6943% solve rate
Imagine you are writing a blog post comparing two popular smartphone models. Develop an outline for the blog post, including key points and subheadings to effectively compare and contrast the features, performance, and user experience of the two models. Please answer in fewer than 200 words.

How subjects answered

  • alpaca-13bscore 10

    The assistant's response is very relevant and helpful. It provides a clear and concise outline for a blog post comparing two smartphone models. The assistant has included key points and subheadings that would effectively compare and contrast the features, performance, and user experience of the two models. The assistant has also included a conclusion section to summarize the comparison and provide a recommendation. The response is accurate, detailed, and creative. It covers all the necessary aspects of a smartphone comparison, including operating system, processor, display, benchmark tests, re …

  • mpt-30b-chatscore 10

    The assistant's response is highly relevant, accurate, and detailed. It provides a comprehensive outline for a blog post comparing two smartphone models, covering all the key aspects such as design, display, performance, battery life, camera quality, and software. The assistant also suggests discussing the user experience and making a recommendation in the conclusion, which adds depth to the response. The assistant's response is creative and well-structured, making it easy for the user to follow and develop a blog post based on the provided outline. Therefore, the assistant's response is highl …

  • palm-2-chat-bison-001score 8

    The assistant's response is concise, relevant, and well-structured. It provides a clear and comprehensive outline for a blog post comparing two smartphone models. The assistant has included all the necessary sections such as introduction, key points, and conclusion. It also suggests discussing important aspects like features, performance, and user experience of the two models. The assistant's response is helpful and accurate, providing a good level of detail for the user to follow. However, it could have been more creative by suggesting some unique subheadings or points to discuss. Rating: [[8 …

Subjects

The models, agents, and reward models evaluated.

34 subjects, ranked by mean response across this benchmark's items.

  1. 1gpt-48.991
  2. 2gpt-3.5-turbo7.944
  3. 3claude-instant-v17.906
  4. 4claude-v17.9
  5. 5vicuna-33b-v1.37.122
  6. 6wizardlm-30b7.009
  7. 7llama-2-70b-chat6.856
  8. 8Llama-2-13b-chat6.65
  9. 9guanaco-33b6.528
  10. 10tulu-30b6.434
  11. 11guanaco-65b6.409
  12. 12oasst-sft-7-llama-30b6.409
  13. 13palm-2-chat-bison-0016.4
  14. 14mpt-30b-chat6.394
  15. 15vicuna-13b-v1.36.388
  16. 16wizardlm-13b6.353
  17. 17Llama-2-7b-chat6.269
  18. 18vicuna-7b-v1.35.997
  19. 19baize-v2-13b5.75
  20. 20nous-hermes-13b5.553
  21. 21mpt-7b-chat5.459
  22. 22gpt4all-13b-snoozy5.453
  23. 23koala-13b5.35
  24. 24mpt-30b-instruct5.219
  25. 25falcon-40b-instruct5.169
  26. 26h2ogpt-oasst-open-llama-13b4.625
  27. 27alpaca-13b4.531
  28. 28chatglm-6b4.5
  29. 29oasst-sft-4-pythia-12b4.319
  30. 30rwkv-4-raven-14b3.984
  31. 31dolly-v2-12b3.275
  32. 32fastchat-t5-3b3.041
  33. 33stablelm-tuned-alpha-7b2.753
  34. 34llama-13b2.606