Skip to main content

Human Preference & Alignment

VisIT-Bench

VisIT-Bench: 592 real-world image+instruction vision-language queries across ~70 instruction families. Two public per-(model, item) sources are ingested: (1) 5011 human pairwise-preference battles among six vision-language models (MiniGPT-4, LLaMA-Adapter-v2, LLaVA-13B, PandaGPT-13B, mPLUG-Owl, InstructBLIP) -> binary win/loss per (model, item, opponent); (2) human correctness ratings of GPT-4 predictions -> binary correct/incorrect per item for GPT-4. response is binary in {0, 1}; the full model answer text is kept as trace.

536items
7subjects
50%observed
CC-BY-4.0license
generaldomain
preferencedomain
imagemodality
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 7 subjects × 536 items, 50% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

This condition combination wasn’t evaluated — try a different attack, category, or judge.
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range, each shown with a few subjects' actual answers.

Item 10% solve rate

Can you count all the dots on the white dice shown in the image?

How subjects answered

  • GPT-4 incorrect

    In the image, there are 4 white dice with the following visible faces:

    1. Die with 2, 1, and 3 showing: 2+1+3 = 6 dots
    2. Die with 3, 5, and 1 showing: 3+5+1 = 9 dots
    3. Die with 5, 4, and 1 showing: 5+4+1 = 10 dots
    4. Die with 2, 4, and 1 showing: 2+4+1 = 7 dots

    Adding all the dots together: 6+9+10+7 = 32 dots

    There are 32 dots visible on the white dice in the image.

Item 26% solve rate

Answer the following question based on the provided image: Are the trousers the same color as the shoes?

How subjects answered

  • GPT-4 correct

    No, the trousers are not the same color as the shoes. The trousers are burgundy, while the shoes are white.

  • LLaVA-13B incorrect

    No, the trousers are not the same color as the shoes. The person is wearing purple pants and is riding skis on a rail, while the shoes are different from the pants.

  • LLaMA-Adapter-v2 incorrect

    Yes, the trousers are the same color as the shoes, which are green.

  • MiniGPT-4 incorrect

    Yes, the trousers and shoes in the image are the same color.

Item 333% solve rate

Answer the following question based on the provided image: What is this book called?

How subjects answered

  • MiniGPT-4 correct

    This book is called "Tom Gates Everything is Amazing."

  • GPT-4 correct

    The book is called "TOM GATES Everything's Amazing (sort of) by L. Pichen."

  • LLaVA-13B incorrect

    The book in the image is called "Tom Gates: Everything's Amazing (and I'm Not)". It is a children's book, and the cover is colorful, which suggests that it is intended to be engaging and appealing to young readers.

  • LLaMA-Adapter-v2 incorrect

    The book is called "Tom Gates: Everything's Amazing (Sort of)!"

  • mPLUG-Owl incorrect

    The book in the image is called Everything Amazing: An Incredible Book About the Science of Everyday Things.

Item 441% solve rate

What are these people doing and what does it mean and what is it for?

How subjects answered

  • LLaMA-Adapter-v2 correct

    The people in the image are participating in a group fitness activity, specifically a Zumba class. They are standing in a line, holding hand weights, and dancing to the music. The Zumba class is a popular exercise program that combines dance moves with aerobic exercise. It is a fun and engaging way to stay active and improve fitness, as it incorporates various dance styles such as salsa, merengue, and reggaeton. The group fitness setting allows participants to enjoy the activity together, motivating each other and creating a supportive environment.

  • GPT-4 correct

    In the image, people are gathered around the large white statue of T-Rex, taking photos and admiring the impressive sculpture. This gathering could be for the unveiling of the statue or a tourist attraction at a theme park or museum. The purpose of the statue is to educate and entertain visitors about the prehistoric creature and its significance in the history of our planet.

  • PandaGPT-13B incorrect

    In the image, a group of people is standing in front of a wall with a large sign on it. The sign is a combination of various symbols, including a box, a triangle, and a square. It appears to be a sign for a business or an establishment. The people seem to be either admiring the sign or discussing something related to the business. The sign serves as an attention-grabbing and informative display for the establishment, helping to convey its purpose or identity to potential customers or visitors.

  • LLaVA-13B incorrect

    In the image, a group of people is participating in a workout or exercise class, following the instructions of an instructor. They are holding hand weights and are in various positions, indicating that they are performing different exercises. This scene could be part of a fitness center, a gym, or an event promoting health and well-being. The purpose of this activity is to improve physical fitness, strength, and overall health by engaging in a structured exercise routine, under the guidance of a professional instructor. It also provides an opportunity for social interaction and support among the participants, which can enhance the motivation to maintain a healthy lifestyle.

  • InstructBLIP incorrect

    In the image, a large group of people are gathered in a gym or sports hall, and they are all holding Wii remotes. They appear to be participating in a Wii Fitness class, which is a type of exercise program that utilizes the Wii console's motion-sensing technology to simulate various physical activities and exercises. The Wii Fitness class is designed to provide a fun and engaging way for people to stay physically active and improve their fitness levels. It can be used as an alternative to traditional gym workouts or as a supplement to existing fitness routines. By participating in these classes, the people in the image are taking advantage of the benefits that the Wii Fitness program …

Item 544% solve rate

Will the person holding the ropes in the middle of the image attempt to get on the lifeboat that other people in the boat are working to flip over ?

How subjects answered

  • LLaVA-13B correct

    Based on the image, it is plausible that the person holding onto the ropes in the middle of the scene might attempt to get on the lifeboat that the other people are working to flip over. Since the person is in a position to either climb or grab the lifeboat, they might take the opportunity to get on it once it's flipped, especially if they feel that it's necessary to do so for their safety or to help with the rescue efforts. However, it's also possible that they might wait for instructions from others or assess the situation before making any decision.

  • GPT-4 correct

    It is likely that the crew member hanging from the ropes in the middle of the image will attempt to get on the lifeboat, as they are probably trying to ensure their own safety amidst the chaos.

  • LLaMA-Adapter-v2 incorrect

    It is not possible to definitively determine the person's intentions from the image alone. However, given the context of the scene, it is likely that the person holding the ropes in the middle of the image will attempt to get on the lifeboat that other people in the boat are working to flip over.

  • MiniGPT-4 incorrect

    It's difficult to tell from the image alone whether the person in the middle of the boat is attempting to get on the lifeboat that the others are working on flipping over. The image is too blurry to make out any details.

Item 650% solve rate

How many places will I go through before I reach Ischia based off the sign?

How subjects answered

  • MiniGPT-4 correct

    It looks like the sign you've provided is a road sign with multiple directions. Without more information, it's difficult to accurately determine how many places you'll go through before reaching Ischia.

    Ischia is a small island located in the Gulf of Naples, Italy. It's known for its thermal spas, which offer a range of treatments for various ailments. The island is also known for its beautiful scenery and rich history, including ancient ruins and medieval architecture. If you are planning to visit Ischia, you may want to research the various towns and villages on the island to determine which is closest to Ischia and the most convenient for your trip.

    I hope this helps! Let me know if …

  • GPT-4 correct

    You will go through 4 places before you reach Ischia based on the sign.

  • mPLUG-Owl correct

    To reach Ischia from the sign, you will pass through 14 cities and countries, including New Zealand, Australia, the United States, England, France, Italy, and Germany.

  • LLaMA-Adapter-v2 incorrect

    Based on the sign, you will go through 12 different places before reaching Ischia.

  • PandaGPT-13B incorrect

    Based on the sign, you will go through three places before reaching Ischia: Caserta, Castel di Sasso, and Ischia.

Item 752% solve rate

What problems and dangers does this cause?

How subjects answered

  • mPLUG-Owl correct

    The presence of a nail in the tire poses several problems and dangers. The nail may cause the tire to deflate or puncture, leading to a loss of control over the vehicle. This could result in accidents, injuries, or damage to property due to the sudden loss of tire pressure. Additionally, if the nail is embedded deeply within the tire tread, the driver may not notice the puncture immediately, increasing the risk of an uncontrolled tire blowout. Furthermore, the nail could cause the tire to wear unevenly or prematurely, which can lead to additional maintenance and potentially shorten the tire's lifespan. It is crucial for drivers to check their tires regularly and have them repaired or …

  • GPT-4 correct

    A metal nail embedded into a car tire can cause air leakage, leading to a flat tire or blowout. This poses dangers such as loss of vehicle control, accidents, and potential harm to the driver, passengers, and other road users.

  • InstructBLIP correct

    The image shows a car tire with a nail stuck in it, which can cause several problems and dangers. Firstly, the nail can puncture the tire, leading to a flat tire or even a blowout, which can be dangerous while driving. Secondly, if the tire is not repaired or replaced promptly, the nail can continue to cause further damage to the tire, potentially leading to a complete tire failure. Thirdly, driving on a flat or damaged tire can affect the vehicle's handling and stability, increasing the risk of accidents or collisions. Therefore, it is essential to ensure that the tire is repaired or replaced as soon as possible to avoid any potential safety hazards on the road.

  • LLaMA-Adapter-v2 incorrect

    The presence of a nail or a sharp object embedded in the tire can pose several problems and dangers. Firstly, it can cause a puncture in the tire, leading to a loss of air pressure and potentially causing the tire to deflate or even burst. This can result in a dangerous situation, as the driver may lose control of the vehicle, leading to accidents or even a complete loss of steering. Secondly, the sharp object can cause damage to the tire's structure, leading to a weakened tire that may fail prematurely or even cause the tire to burst while driving. This can be particularly dangerous if the tire is on the driver's side, as it could cause the vehicle to veer off the road or even roll over. …

  • MiniGPT-4 incorrect

    The image shows a tire tread that has been damaged or worn down. This can cause several problems, including:

    1. The tire may become more prone to blowouts or punctures, which can lead to a car accident.
    2. The tire may have a higher risk of wearing out or becoming damaged.
    3. The tire may have a lower level of grip on the road, which can lead to a higher risk of skidding or losing control.
    4. The tire may have a lower level of resistance to heat, which can lead to overheating and potentially damaging the car's engine.

    It's important to get the tire repaired or replaced as soon as possible to avoid these problems and ensure the safety of the vehicle.

Item 852% solve rate

Why is the person in blue dress and long hair here ?

How subjects answered

  • MiniGPT-4 correct

    This image appears to show a group of people sitting at tables at an outdoor event, with red banners in the background. They are all dressed in formal attire and appear to be engaging in conversation. The woman in the blue dress and long hair is standing at the edge of the group and appears to be writing on a piece of paper. The other people in the group are engaged in conversation and appear to be having a good time.

  • GPT-4 correct

    The person in the blue dress and long hair is here to attend the event and receive instructions from the man in the gray blazer.

  • LLaVA-13B correct

    Based on the image, it is not possible to determine the exact reason why the person in the blue dress and long hair is present. However, it can be inferred that they are part of the group of people gathered at the table, which includes a businessman, a woman in a blue dress, and others. They could be attending a social event, a business meeting, or participating in some form of entertainment. The presence of wine glasses and a bottle on the table suggests that the gathering might be a relaxed or festive one, but the specific purpose or occasion remains unclear.

  • PandaGPT-13B incorrect

    The person in the blue dress and long hair is here because they are standing in front of a chalkboard, which is a common classroom tool used for teaching and learning. The presence of the chalkboard suggests that they are likely a student or a teacher attending a class or participating in a lesson. The chalkboard itself has a drawing of a brain, which might be related to the subject being taught or discussed in the classroom. The person's role in the scene could be to learn, teach, or contribute to the discussion related to the brain drawing or any other educational content displayed on the chalkboard.

  • mPLUG-Owl incorrect

    The person in blue dress and long hair is likely present at the Chinese wedding ceremony or reception, as they are seen in a scene with several other guests. The wedding is taking place outdoors, as there are chairs and an umbrella, and the guests are gathered around. The presence of the long-haired woman indicates that she could be part of a cultural tradition or a unique aspect of the wedding celebration.

Item 954% solve rate

Answer the following question based on the provided image: What interests the child most here?

How subjects answered

  • LLaMA-Adapter-v2 correct

    The child's interest in the image is the cake, specifically the candle on top of the cake. The child is sitting at a table, and the cake is placed in front of them. The child is looking at the cake and the lit candle, which indicates that they are fascinated by the birthday celebration and the cake itself.

  • GPT-4 correct

    The child is most interested in the cupcake with a single lit candle. This can be inferred from the description stating that the toddler is staring at the cupcake.

  • LLaVA-13B correct

    In the image, the young child is most interested in the chocolate birthday cake placed on the dining table in front of him. The child is sitting at the table and staring at the cake, likely excited and eager to celebrate his birthday and enjoy the delicious treat. The presence of the cake captures the child's attention and sparks anticipation for the special occasion.

  • PandaGPT-13B incorrect

    In the image, the child appears to be most interested in the word "math" written on the chalkboard. The presence of the word, along with the arrows pointing towards it, suggests that the child is drawn to the subject of math or is engaged in a math-related activity or discussion.

  • MiniGPT-4 incorrect

    The child in the image appears to be interested in the birthday cake with a lit candle in front of it. The child may be excited to blow out the candle and have a piece of cake.

Item 1055% solve rate

What are each of these items called, and what is the purpose of these products, as a whole?

How subjects answered

  • LLaMA-Adapter-v2 correct

    Each of these items is a type of medical instrument, and they are used for various purposes in the medical field. The items include a pair of scissors, a scalpel, a pair of tweezers, a pair of forceps, and a pair of pliers. The purpose of these products is to provide medical professionals with tools to perform various tasks, such as cutting, suturing, extracting, and holding objects during surgical procedures. The selection of these tools depends on the specific needs of the surgery or medical procedure being performed.

  • GPT-4 correct

    The items, from left to right, are called cuticle pusher, earwax scoop, nail clipper, toenail clipper, scissors, blackhead remover, tweezers, and nail cleaner/file. The purpose of these products, as a whole, is to provide essential tools for personal grooming and hygiene maintenance.

  • LLaVA-13B incorrect

    The image displays a row of various scissors, with a total of 11 pairs, lined up next to each other on a blue background. These scissors have different shapes and sizes, and they serve different purposes. Some may be used for cutting fabric, paper, or other materials, while others may be designed specifically for more intricate or specialized tasks, like trimming or embroidery. Collectively, these scissors represent a diverse set of tools that can cater to a wide range of cutting requirements, making them valuable assets in various crafting, sewing, or art projects.

Item 1155% solve rate

Generate a question based on the given image such that it receives the following answer: cough drops

How subjects answered

  • LLaVA-13B correct

    "What are some effective strategies for staying focused while working on a task, such as the man in the image who is sitting in front of a laptop and taking notes on paper?"

  • GPT-4 correct

    What type of product is advertised on the bag beside the laptop in the image?

    Explanation: The bag says "Ricola" on it, which is a brand of cough drops.

  • InstructBLIP incorrect

    or throat lozenges

Item 1283% solve rate

Answer the following question based on the provided image: How many people are on the field?

How subjects answered

  • LLaMA-Adapter-v2 correct

    There are three people on the field in the image: the batter, the catcher, and the umpire.

  • GPT-4 correct

    Based on the provided image description, there are 3 people on the field: the baseball player up to bat, the catcher, and the umpire.

Subjects

The models, agents, and reward models evaluated.

7 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1GPT-40.9152
  2. 2LLaVA-13B0.5889
  3. 3LLaMA-Adapter-v20.5607
  4. 4mPLUG-Owl0.5108
  5. 5InstructBLIP0.4791
  6. 6MiniGPT-40.3351
  7. 7PandaGPT-13B0.2089