Skip to main content

Agents & Tool Use

PlanBench

PlanBench: evaluating LLMs on automated-planning and reasoning-about-change tasks over classic IPC domains (Blocksworld, Logistics, Sokoban) plus mystery / obfuscated / randomized / unsolvable variants. We ingest the REAL per-(model, item) graded outputs from the maintained official repo: each response is one LLM's binary correctness on one planning instance (plan generation), graded by the VAL plan verifier or the run's recorded correctness flag. Items are the real natural-language planning prompts; correct_answer is the ground-truth plan.

10,703items
17subjects
20%observed
MITlicense
generaldomain
planningdomain
reasoningdomain
textmodality

Response matrix

Every model, scored item by item.

Each row is an AI model and each column an item, ordered so the strongest models and easiest items gather toward one corner. 17 subjects × 10,703 items, 20% of cells evaluated.

Fit to width. Hover for subject & item; click a cell for details.

PlanBench response matrix: AI models (rows) against items (columns)
Correct (1)Incorrect (0)Unobserved

Scale: 1 = correct · 0 = incorrect

Sample items

What the questions look like — and how subjects answer.

A spread of items across the difficulty range. This benchmark does not publish per-answer traces, so each item shows which subjects succeeded.

Item 10% solve rateanswer: (unstack a d) (put-down a) (unstack d b) (put-down d) (pick-up a) (stack a b)

Here is a pddl domain, an example problem and it's corresponding plan. Provide the plan for the query problem. Provide only the pddl syntax for the plan. [DOMAIN] (define (domain blocksworld-4ops) (:requirements :strips) (:predicates (clear ?x) (ontable ?x) (handempty) (holding ?x) (on ?x ?y))

(:action pick-up :parameters (?ob) :precondition (and (clear ?ob) (ontable ?ob) (handempty)) :effect (and (holding ?ob) (not (clear ?ob)) (not (ontable ?ob)) (not (handempty))))

(:action put-down :parameters (?ob) :precondition (holding ?ob) :effect (and (clear ?ob) (handempty) (ontable ?ob) (not (holding ?ob))))

(:action stack :parameters (?ob ?underob) :precondition (and (clear ?underob) (holding ?ob)) :effect (and (handempty) (clear ?ob) (on ?ob ?underob) (not (clear ?underob)) (not (holding ?ob))))

(:action unstack :parameters (?ob ?underob) :precondition (and (on ?ob ?underob) (clear ?ob) (handempty)) :effect (and (holding ?ob) (clear ?underob) (not (on ?ob ?underob)) (not (clear ?ob)) (not (handempty)))))

[PROBLEM] (define (problem BW-rand-4) (:domain blocksworld-4ops) (:objects a b c d ) (:init (handempty) (on a b) (ontable b) (ontable c) (on d a) (clear c) (clear d) ) (:goal (and (on a b) (on c d)) ) )

[PLAN] (pick-up c) (stack c d) [PLAN_END]

[QUERY PROBLEM] (define (problem BW-rand-4) (:domain blocksworld-4ops) (:objects a b c d ) (:init (handempty) (on a d) (on b c) (ontable c) (on d b) (clear a) ) (:goal (and (on a b)) ) )

Subject outcomes

  • gpt-4_chat incorrect
Item 20% solve rateanswer: (unstack red yellow) (put-down red) (unstack yellow blue) (put-down yellow) (unstack blue orange) (put-down blue) (pick-up orange) (stack orange blue) (pick-up red) (stack red orange) (pick-up yellow) (stack yellow red)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block Unstack a block from on top of another block Put down a block Stack a block on top of another block

I have the following restrictions on my actions: I can only pick up or unstack one block at a time. I can only pick up or unstack a block if my hand is empty. I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. I can only put down a block that I am holding. I can only stack a block on top of another block if I am holding the block being stacked. I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. Once you stack a block on top of a second block, the second block is no longer clear.

The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is executable in a state when all its preconditions hold in that state. The state resulting from the action execution consists of everything in the previous state with the addition and deletion of add and delete effects of the action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e., its preconditions are all present there; and the second action is applicable in the state resulting from applying the first action to the initial state, this process continues until the state resulting from the application of the last action in the last but one state gives rise to the final state where all the goals are satisfied.

[STATEMENT] As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table My goal is to have that the orange block is on top of the blue block. My plan is as follows:

[PLAN]

  1. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: unstack the blue block from on top of the orange block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear, the hand is empty and the blue block is on top of the orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table

  2. Current State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: put down the blue block Reason: The above action is applicable in the current state because its preconditions; the hand is currently holding blue block, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table

  3. Current State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table Action: pick up the orange block Reason: The above action is applicable in the current state because its preconditions; the orange block is clear, the hand is empty and the orange block is on the table, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table

  4. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table Action: stack the orange block on top of the blue block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear and the hand is currently holding orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table

Final State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table The goal conditions are satisfied in the final state. Hence, the above plan is valid. [PLAN END]

[STATEMENT] As initial conditions I have that, the red block is clear, the hand is empty, the red block is on top of the yellow block, the blue block is on top of the orange block, the yellow block is on top of the blue block and the orange block is on the table My goal is to have that the red block is on top of the orange block, the orange block is on top of the blue block and the yellow block is on top of the red block. My plan is as follows:

[PLAN]

Subject outcomes

  • gpt-4_chat incorrect
Item 30% solve rateanswer: (feast d b) (succumb d) (feast b a) (succumb b) (feast a c) (succumb a) (attack c) (overcome c b) (attack d) (overcome d a)

I am playing with a set of objects. Here are the actions I can do

Attack object Feast object from another object Succumb object Overcome object from another object

I have the following restrictions on my actions: To perform Attack action, the following facts need to be true: Province object, Planet object, Harmony. Once Attack action is performed the following facts will be true: Pain object. Once Attack action is performed the following facts will be false: Province object, Planet object, Harmony. To perform Succumb action, the following facts need to be true: Pain object. Once Succumb action is performed the following facts will be true: Province object, Planet object, Harmony.
Once Succumb action is performed the following facts will be false: Pain object. To perform Overcome action, the following needs to be true: Province other object, Pain object. Once Overcome action is performed the following will be true: Harmony, Province object, Object Craves other object. Once Overcome action is performed the following will be false: Province other object, Pain object. To perform Feast action, the following needs to be true: Object Craves other object, Province object, Harmony. Once Feast action is performed the following will be true: Pain object, Province other object. Once Feast action is performed the following will be false:, Object Craves other object, Province object, Harmony.

The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is executable in a state when all its preconditions hold in that state. The state resulting from the action execution consists of everything in the previous state with the addition and deletion of add and delete effects of the action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e., its preconditions are all present there; and the second action is applicable in the state resulting from applying the first action to the initial state, this process continues until the state resulting from the application of the last action in the last but one state gives rise to the final state where all the goals are satisfied.

[STATEMENT] As initial conditions I have that, object a craves object c, object d craves object b, harmony, planet object b, planet object c, province object a and province object d My goal is to have that object a craves object d and object b craves object a. My plan is as follows:

[PLAN]

  1. Current State: object a craves object c, object d craves object b, harmony, planet object b, planet object c, province object a and province object d Action: feast object d from object b Reason: The above action is applicable in the current state because its preconditions; object d craves object b, harmony and province object d, are satisfied in the current state. Resulting State: object a craves object c, pain object d, planet object b, planet object c, province object a and province object b

  2. Current State: object a craves object c, pain object d, planet object b, planet object c, province object a and province object b Action: succumb object d Reason: The above action is applicable in the current state because its preconditions; pain object d, are satisfied in the current state. Resulting State: object a craves object c, harmony, planet object b, planet object c, planet object d, province object a, province object b and province object d

  3. Current State: object a craves object c, harmony, planet object b, planet object c, planet object d, province object a, province object b and province object d Action: feast object a from object c Reason: The above action is applicable in the current state because its preconditions; object a craves object c, harmony and province object a, are satisfied in the current state. Resulting State: pain object a, planet object b, planet object c, planet object d, province object b, province object c and province object d

  4. Current State: pain object a, planet object b, planet object c, planet object d, province object b, province object c and province object d Action: overcome object a from object d Reason: The above action is applicable in the current state because its preconditions; pain object a and province object d, are satisfied in the current state. Resulting State: object a craves object d, harmony, planet object b, planet object c, planet object d, province object a, province object b and province object c

  5. Current State: object a craves object d, harmony, planet object b, planet object c, planet object d, province object a, province object b and province object c Action: attack object b Reason: The above action is applicable in the current state because its preconditions; harmony, planet object b and province object b, are satisfied in the current state. Resulting State: object a craves object d, pain object b, planet object c, planet object d, province object a and province object c

  6. Current State: object a craves object d, pain object b, planet object c, planet object d, province object a and province object c Action: overcome object b from object a Reason: The above action is applicable in the current state because its preconditions; pain object b and province object a, are satisfied in the current state. Resulting State: object a craves object d, object b craves object a, harmony, planet object c, planet object d, province object b and province object c

Final State: object a craves object d, object b craves object a, harmony, planet object c, planet object d, province object b and province object c The goal conditions are satisfied in the final state. Hence, the above plan is valid. [PLAN END]

[STATEMENT] As initial conditions I have that, object a craves object c, object b craves object a, object d craves object b, harmony, planet object c and province object d My goal is to have that object c craves object b and object d craves object a. My plan is as follows:

[PLAN]

Subject outcomes

  • gpt-4_chat incorrect
Item 40% solve rateanswer: (feast c b) (succumb c) (feast b a) (succumb b) (attack a) (overcome a d)

Here is a pddl domain, a planning problem. Provide the plan for the query problem. Provide only the pddl syntax for the plan where each action is represented as (ACTION_NAME OBJECTS). [DOMAIN] (define (domain mystery-4ops) (:requirements :strips) (:predicates (province ?x) (planet ?x) (harmony) (pain ?x) (craves ?x ?y))

(:action attack :parameters (?ob) :precondition (and (province ?ob) (planet ?ob) (harmony)) :effect (and (pain ?ob) (not (province ?ob)) (not (planet ?ob)) (not (harmony))))

(:action succumb :parameters (?ob) :precondition (pain ?ob) :effect (and (province ?ob) (harmony) (planet ?ob) (not (pain ?ob))))

(:action overcome :parameters (?ob ?underob) :precondition (and (province ?underob) (pain ?ob)) :effect (and (harmony) (province ?ob) (craves ?ob ?underob) (not (province ?underob)) (not (pain ?ob))))

(:action feast :parameters (?ob ?underob) :precondition (and (craves ?ob ?underob) (province ?ob) (harmony)) :effect (and (pain ?ob) (province ?underob) (not (craves ?ob ?underob)) (not (province ?ob)) (not (harmony)))))

[QUERY PROBLEM] (define (problem MY-rand-4) (:domain mystery-4ops) (:objects a b c d ) (:init (harmony) (planet a) (craves b a) (craves c b) (planet d) (province c) (province d) ) (:goal (and (craves a d)) ) )

Subject outcomes

  • gpt-4_chat incorrect
Item 50% solve rateanswer: (feast a b) (overcome a d) (feast b c) (succumb b) (attack c) (overcome c a)

I am playing with a set of objects. Here are the actions I can do

Attack object Feast object from another object Succumb object Overcome object from another object

I have the following restrictions on my actions: To perform Attack action, the following facts need to be true: Province object, Planet object, Harmony. Once Attack action is performed the following facts will be true: Pain object. Once Attack action is performed the following facts will be false: Province object, Planet object, Harmony. To perform Succumb action, the following facts need to be true: Pain object. Once Succumb action is performed the following facts will be true: Province object, Planet object, Harmony.
Once Succumb action is performed the following facts will be false: Pain object. To perform Overcome action, the following needs to be true: Province other object, Pain object. Once Overcome action is performed the following will be true: Harmony, Province object, Object Craves other object. Once Overcome action is performed the following will be false: Province other object, Pain object. To perform Feast action, the following needs to be true: Object Craves other object, Province object, Harmony. Once Feast action is performed the following will be true: Pain object, Province other object. Once Feast action is performed the following will be false:, Object Craves other object, Province object, Harmony.

The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is executable in a state when all its preconditions hold in that state. The state resulting from the action execution consists of everything in the previous state with the addition and deletion of add and delete effects of the action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e., its preconditions are all present there; and the second action is applicable in the state resulting from applying the first action to the initial state, this process continues until the state resulting from the application of the last action in the last but one state gives rise to the final state where all the goals are satisfied.

[STATEMENT] As initial conditions I have that, object a craves object b, harmony, planet object b, planet object c, planet object d, province object a, province object c and province object d My goal is to have that object c craves object a. My plan is as follows:

[PLAN]

  1. Current State: object a craves object b, harmony, planet object b, planet object c, planet object d, province object a, province object c and province object d Action: attack object c Reason: The above action is applicable in the current state because its preconditions; harmony, planet object c and province object c, are satisfied in the current state. Resulting State: object a craves object b, pain object c, planet object b, planet object d, province object a and province object d

  2. Current State: object a craves object b, pain object c, planet object b, planet object d, province object a and province object d Action: overcome object c from object a Reason: The above action is applicable in the current state because its preconditions; pain object c and province object a, are satisfied in the current state. Resulting State: object a craves object b, object c craves object a, harmony, planet object b, planet object d, province object c and province object d

Final State: object a craves object b, object c craves object a, harmony, planet object b, planet object d, province object c and province object d The goal conditions are satisfied in the final state. Hence, the above plan is valid. [PLAN END]

[STATEMENT] As initial conditions I have that, object a craves object b, object b craves object c, harmony, planet object c, planet object d, province object a and province object d My goal is to have that object a craves object d and object c craves object a. My plan is as follows:

[PLAN]

Subject outcomes

  • gpt-4_chat incorrect
Item 60% solve rateanswer: (unstack b a) (put-down b) (unstack d c) (put-down d) (pick-up a) (stack a d) (pick-up c) (stack c b)

Here is a pddl domain, an example problem and it's corresponding plan. Provide the plan for the query problem. Provide only the pddl syntax for the plan. [DOMAIN] (define (domain blocksworld-4ops) (:requirements :strips) (:predicates (clear ?x) (ontable ?x) (handempty) (holding ?x) (on ?x ?y))

(:action pick-up :parameters (?ob) :precondition (and (clear ?ob) (ontable ?ob) (handempty)) :effect (and (holding ?ob) (not (clear ?ob)) (not (ontable ?ob)) (not (handempty))))

(:action put-down :parameters (?ob) :precondition (holding ?ob) :effect (and (clear ?ob) (handempty) (ontable ?ob) (not (holding ?ob))))

(:action stack :parameters (?ob ?underob) :precondition (and (clear ?underob) (holding ?ob)) :effect (and (handempty) (clear ?ob) (on ?ob ?underob) (not (clear ?underob)) (not (holding ?ob))))

(:action unstack :parameters (?ob ?underob) :precondition (and (on ?ob ?underob) (clear ?ob) (handempty)) :effect (and (holding ?ob) (clear ?underob) (not (on ?ob ?underob)) (not (clear ?ob)) (not (handempty)))))

[PROBLEM] (define (problem BW-rand-4) (:domain blocksworld-4ops) (:objects a b c d ) (:init (handempty) (on a b) (ontable b) (on c a) (on d c) (clear d) ) (:goal (and (on a c) (on b d) (on d a)) ) )

[PLAN] (unstack d c) (put-down d) (unstack c a) (put-down c) (unstack a b) (stack a c) (pick-up d) (stack d a) (pick-up b) (stack b d) [PLAN_END]

[QUERY PROBLEM] (define (problem BW-rand-4) (:domain blocksworld-4ops) (:objects a b c d ) (:init (handempty) (ontable a) (on b a) (ontable c) (on d c) (clear b) (clear d) ) (:goal (and (on a d) (on c b)) ) )

Subject outcomes

  • gpt-4_chat incorrect
Item 70% solve rateanswer: (feast d c) (succumb d) (feast c b) (succumb c) (attack b) (overcome b a) (attack d) (overcome d b)

Here is a pddl domain, a planning problem. Provide the plan for the query problem. Provide only the pddl syntax for the plan where each action is represented as (ACTION_NAME OBJECTS). [DOMAIN] (define (domain mystery-4ops) (:requirements :strips) (:predicates (province ?x) (planet ?x) (harmony) (pain ?x) (craves ?x ?y))

(:action attack :parameters (?ob) :precondition (and (province ?ob) (planet ?ob) (harmony)) :effect (and (pain ?ob) (not (province ?ob)) (not (planet ?ob)) (not (harmony))))

(:action succumb :parameters (?ob) :precondition (pain ?ob) :effect (and (province ?ob) (harmony) (planet ?ob) (not (pain ?ob))))

(:action overcome :parameters (?ob ?underob) :precondition (and (province ?underob) (pain ?ob)) :effect (and (harmony) (province ?ob) (craves ?ob ?underob) (not (province ?underob)) (not (pain ?ob))))

(:action feast :parameters (?ob ?underob) :precondition (and (craves ?ob ?underob) (province ?ob) (harmony)) :effect (and (pain ?ob) (province ?underob) (not (craves ?ob ?underob)) (not (province ?ob)) (not (harmony)))))

[QUERY PROBLEM] (define (problem MY-rand-4) (:domain mystery-4ops) (:objects a b c d ) (:init (harmony) (planet a) (planet b) (craves c b) (craves d c) (province a) (province d) ) (:goal (and (craves b a) (craves d b)) ) )

Subject outcomes

  • gpt-4_chat incorrect
Item 80% solve rateanswer: (unstack yellow orange) (put-down yellow) (unstack orange blue) (stack orange yellow) (unstack blue red) (put-down blue) (pick-up red) (stack red orange)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block Unstack a block from on top of another block Put down a block Stack a block on top of another block

I have the following restrictions on my actions: I can only pick up or unstack one block at a time. I can only pick up or unstack a block if my hand is empty. I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. I can only put down a block that I am holding. I can only stack a block on top of another block if I am holding the block being stacked. I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. Once you stack a block on top of a second block, the second block is no longer clear.

The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is executable in a state when all its preconditions hold in that state. The state resulting from the action execution consists of everything in the previous state with the addition and deletion of add and delete effects of the action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e., its preconditions are all present there; and the second action is applicable in the state resulting from applying the first action to the initial state, this process continues until the state resulting from the application of the last action in the last but one state gives rise to the final state where all the goals are satisfied.

[STATEMENT] As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table My goal is to have that the orange block is on top of the blue block. My plan is as follows:

[PLAN]

  1. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: unstack the blue block from on top of the orange block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear, the hand is empty and the blue block is on top of the orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table

  2. Current State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: put down the blue block Reason: The above action is applicable in the current state because its preconditions; the hand is currently holding blue block, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table

  3. Current State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table Action: pick up the orange block Reason: The above action is applicable in the current state because its preconditions; the orange block is clear, the hand is empty and the orange block is on the table, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table

  4. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table Action: stack the orange block on top of the blue block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear and the hand is currently holding orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table

Final State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table The goal conditions are satisfied in the final state. Hence, the above plan is valid. [PLAN END]

[STATEMENT] As initial conditions I have that, the yellow block is clear, the hand is empty, the blue block is on top of the red block, the orange block is on top of the blue block, the yellow block is on top of the orange block and the red block is on the table My goal is to have that the red block is on top of the orange block and the orange block is on top of the yellow block. My plan is as follows:

[PLAN]

Subject outcomes

  • gpt-4_chat incorrect
Item 99% solve rateanswer: (unstack b c) (put-down b) (unstack c a) (put-down c) (unstack a d) (stack a b) (pick-up d) (stack d a) (pick-up c) (stack c d)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block Unstack a block from on top of another block Put down a block Stack a block on top of another block

I have the following restrictions on my actions: I can only pick up or unstack one block at a time. I can only pick up or unstack a block if my hand is empty. I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. I can only put down a block that I am holding. I can only stack a block on top of another block if I am holding the block being stacked. I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. Once you stack a block on top of a second block, the second block is no longer clear.

[STATEMENT] As initial conditions I have that, the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the yellow block is on top of the blue block, the red block is on the table, the blue block is on the table and the orange block is on the table. My goal is to have that the red block is on top of the yellow block, the orange block is on top of the red block and the yellow block is on top of the blue block.

My plan is as follows:

[PLAN] pick up the red block stack the red block on top of the yellow block pick up the orange block stack the orange block on top of the red block [PLAN END]

[STATEMENT] As initial conditions I have that, the blue block is clear, the hand is empty, the red block is on top of the yellow block, the blue block is on top of the orange block, the orange block is on top of the red block and the yellow block is on the table. My goal is to have that the red block is on top of the blue block, the orange block is on top of the yellow block and the yellow block is on top of the red block.

My plan is as follows:

[PLAN]

Subject outcomes

  • claude-3-opus correct
  • claude-3.5-sonnet_aws incorrect
  • gemini-1.5-flash incorrect
  • llama-3.1-405b_aws incorrect
  • llama3-70b-8192_groq incorrect
  • qwen-qwq incorrect
Item 1031% solve rateanswer: (feast b a) (succumb b) (attack c) (overcome c b) (attack d) (overcome d a)

I am playing with a set of objects. Here are the actions I can do

Attack object Feast object from another object Succumb object Overcome object from another object

I have the following restrictions on my actions: To perform Attack action, the following facts need to be true: Province object, Planet object, Harmony. Once Attack action is performed the following facts will be true: Pain object. Once Attack action is performed the following facts will be false: Province object, Planet object, Harmony. To perform Succumb action, the following facts need to be true: Pain object. Once Succumb action is performed the following facts will be true: Province object, Planet object, Harmony.
Once Succumb action is performed the following facts will be false: Pain object. To perform Overcome action, the following needs to be true: Province other object, Pain object. Once Overcome action is performed the following will be true: Harmony, Province object, Object Craves other object. Once Overcome action is performed the following will be false: Province other object, Pain object. To perform Feast action, the following needs to be true: Object Craves other object, Province object, Harmony. Once Feast action is performed the following will be true: Pain object, Province other object. Once Feast action is performed the following will be false:, Object Craves other object, Province object, Harmony.

[STATEMENT] As initial conditions I have that, object b craves object a, harmony, planet object a, planet object c, planet object d, province object b, province object c and province object d. My goal is to have that object c craves object b and object d craves object a.

What is the plan to achieve my goal? Just give the actions in the plan.

Subject outcomes

  • deepseek-r1-api correct
  • o1-preview_chat correct
  • o1-mini_chat correct
  • gpt-4_chat incorrect
  • llama3-70b-8192_groq incorrect
  • llama-3.1-405b_aws incorrect
Item 1160% solve rateanswer: (pick-up b) (stack b a)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block Unstack a block from on top of another block Put down a block Stack a block on top of another block

I have the following restrictions on my actions: I can only pick up or unstack one block at a time. I can only pick up or unstack a block if my hand is empty. I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. I can only put down a block that I am holding. I can only stack a block on top of another block if I am holding the block being stacked. I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. Once you stack a block on top of a second block, the second block is no longer clear.

[STATEMENT] As initial conditions I have that, the red block is clear, the orange block is clear, the hand is empty, the red block is on top of the blue block, the blue block is on the table and the orange block is on the table. My goal is to have that the red block is on top of the orange block and the blue block is on top of the red block.

My plan is as follows:

[PLAN] unstack the red block from on top of the blue block stack the red block on top of the orange block pick up the blue block stack the blue block on top of the red block [PLAN END]

[STATEMENT] As initial conditions I have that, the red block is clear, the blue block is clear, the hand is empty, the red block is on top of the orange block, the blue block is on the table and the orange block is on the table. My goal is to have that the blue block is on top of the red block.

My plan is as follows:

[PLAN]

Subject outcomes

  • claude-3.5-sonnet_aws correct
  • gemini-1.5-flash correct
  • llama3-70b-8192_groq correct
  • gemini-pro incorrect
  • gpt-3.5-turbo-instruct incorrect
  • gpt-4o_chat incorrect
Item 12100% solve rateanswer: (unstack orange red) (put-down orange) (pick-up red) (stack red blue)

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the actions I can do

Pick up a block Unstack a block from on top of another block Put down a block Stack a block on top of another block

I have the following restrictions on my actions: I can only pick up or unstack one block at a time. I can only pick up or unstack a block if my hand is empty. I can only pick up a block if the block is on the table and the block is clear. A block is clear if the block has no other blocks on top of it and if the block is not picked up. I can only unstack a block from on top of another block if the block I am unstacking was really on top of the other block. I can only unstack a block from on top of another block if the block I am unstacking is clear. Once I pick up or unstack a block, I am holding the block. I can only put down a block that I am holding. I can only stack a block on top of another block if I am holding the block being stacked. I can only stack a block on top of another block if the block onto which I am stacking the block is clear. Once I put down or stack a block, my hand becomes empty. Once you stack a block on top of a second block, the second block is no longer clear.

The plan correctness is defined in terms of states resulting from executing the actions in the plan. An action is executable in a state when all its preconditions hold in that state. The state resulting from the action execution consists of everything in the previous state with the addition and deletion of add and delete effects of the action. Plan correctness is defined as follows: if the first action in the plan is applicable in the initial state, i.e., its preconditions are all present there; and the second action is applicable in the state resulting from applying the first action to the initial state, this process continues until the state resulting from the application of the last action in the last but one state gives rise to the final state where all the goals are satisfied.

[STATEMENT] As initial conditions I have that, the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table My goal is to have that the orange block is on top of the blue block. My plan is as follows:

[PLAN]

  1. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is empty, the blue block is on top of the orange block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: unstack the blue block from on top of the orange block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear, the hand is empty and the blue block is on top of the orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table

  2. Current State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is currently holding blue block, the red block is on the table, the orange block is on the table and the yellow block is on the table Action: put down the blue block Reason: The above action is applicable in the current state because its preconditions; the hand is currently holding blue block, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table

  3. Current State: the red block is clear, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the red block is on the table, the blue block is on the table, the orange block is on the table and the yellow block is on the table Action: pick up the orange block Reason: The above action is applicable in the current state because its preconditions; the orange block is clear, the hand is empty and the orange block is on the table, are satisfied in the current state. Resulting State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table

  4. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the hand is currently holding orange block, the red block is on the table, the blue block is on the table and the yellow block is on the table Action: stack the orange block on top of the blue block Reason: The above action is applicable in the current state because its preconditions; the blue block is clear and the hand is currently holding orange block, are satisfied in the current state. Resulting State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table

Final State: the red block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the blue block, the red block is on the table, the blue block is on the table and the yellow block is on the table The goal conditions are satisfied in the final state. Hence, the above plan is valid. [PLAN END]

[STATEMENT] As initial conditions I have that, the blue block is clear, the orange block is clear, the yellow block is clear, the hand is empty, the orange block is on top of the red block, the red block is on the table, the blue block is on the table and the yellow block is on the table My goal is to have that the red block is on top of the blue block. My plan is as follows:

[PLAN]

Subject outcomes

  • gpt-4_chat correct

Subjects

The models, agents, and reward models evaluated.

17 subjects, ranked by mean response (accuracy) across this benchmark's items.

  1. 1gemini-2.0-flash-thinking-exp0.678
  2. 2deepseek-r1-api0.5688
  3. 3o1-preview_chat0.4464
  4. 4claude-3.5-sonnet_aws0.289
  5. 5claude-3-opus0.272
  6. 6llama-3.1-405b_aws0.2696
  7. 7o1-mini_chat0.2643
  8. 8gemini-1.5-pro0.2005
  9. 9gpt-4o_chat0.1616
  10. 10gpt-4-turbo_chat0.1603
  11. 11gpt-4_chat0.1494
  12. 12llama3-70b-8192_groq0.1239
  13. 13qwen-qwq0.118
  14. 14gemini-1.5-flash0.1003
  15. 15gpt-3.5-turbo-instruct0.0345
  16. 16gemini-pro0.0332
  17. 17gpt-4o-mini-2024-07-18_chat0