THOR Challenge

Navigate and find objects in a virtual environment

In this challenge, the task is to navigate towards a set of objects in indoor scenes based only on visual input. AI2-THOR enables navigation in a set of near-photo-realistic indoor scenes, and we will use this framework for the challenge. The agents that the challenge participants provide receive visual input from THOR and should perform an action based on the visual input.

Competition Conclusion

The THOR Challenge has a winner!

A team from National Tsing Hua University (NTHU) has won the THOR challenge with their strong showing on the discrete navigation challenge.

The NTHU team will present their work in the Workshop on Visual Understanding Across Modalities @ CVPR 2:00-2:30 pm on 7/26, along with an invited talk from Kevin Murphy. Everyone is encouraged to attend!

Discrete Navigation Leaderboard

Rank Entrant Score
1 JamesChuang 0.04248
2 phunghx 0.01651
3 random 0.00001

Test Data

The test data for the THOR challenge is now available. To use it, please download the test target definition file and the THOR build corresponding to your architecture linked here. Installation is unchanged from the instructions below, simply replace the THOR build where it appears.


The instructions for running the robosims framework are also unchanged from those given below. You need only to replace the training targets file with and the THOR build given in unity_path with the appropriate OSX or Linux build linked here.


The task is to navigate and find objects using a set of predefined actions. The agent is placed in a random location within the scene and provided a target image (e.g. image of an apple). The action set is (MoveAhead, MoveRight, MoveLeft, MoveBack, LookUp, LookDown, RotateRight, RotateLeft). The agent must not only navigate to the closest point possible to the object, but also be looking at the object. The release for the challenge includes 30 scenes (15 kitchens and 15 living rooms), where 20 scenes are training scenes, and 10 scenes are for validation. We will release the test scenes later.

Target Image Agent Start Position Agent Target Position


The evaluation server is up and running!
The evaluation server is being hosted by Codalab and can be found here.

A prize of $3,000 will be awarded separately to the highest scoring entrant of the discrete and continuous navigation subtasks.

To submit results, participants should:

  • 1) Sign up for a Codalab account by clicking the Sign Up button on the competition page linked above.
  • 2) Click on the Participate tab, agree to the terms and conditions, and click register.
  • 3) If needed, navigate to the Participate tab and click on the Get Data side tab to download the train and dev sets.
  • 4) Navigate to the Learn the Details tab and click on the Evaluation tab in the sidebar to read about the submission format required.
  • 5) After your request is approved, navigate to the Participate tab and then click on the Submit/View Results tab in the sidebar.
  • 6) Click the submit button and upload a results file.
  • 7) After your submission is scored you will have a chance to review it before posting it to the leaderboard.
  • 8) If you have questions, please ask them in the THOR competition forum (located under the Forum tab)


The tentative dates for the competition are:

Test set is released June 26th (tentative)
Final submission July 15th (11:59PM GMT)
Winners announced July 20th
Conference workshop July 26th


The framework for the challenge consists of three parts:

OSX Installation
pip install
Linux Installation
To run on Linux you must have an X server running that has a GLX module enabled.
pip install

Robosims Example - Training

#!/usr/bin/env python
import robosims
import json
env = robosims.controller.ChallengeController(
    # Use for OSX
    x_display="0.0" # this parameter is ignored on OSX, but you must set this to the appropriate display on Linux
with open("thor-challenge-targets/targets-train.json") as f:
    t = json.loads(
    for target in t:
        # Possible actions are: MoveLeft, MoveRight, MoveAhead, MoveBack, LookUp, LookDown, RotateRight, RotateLeft
        event = env.step(action=dict(action='MoveLeft'))

        # path to the target image (e.g. apple, lettuce, keychain, etc.)

        # target_found() returns True/False if the target has been found

        # image of the current frame from the agent - numpy array of shape (300,300,3) in RGB order
        image = event.frame
        # True/False whether the last action was successful.  Actions will fail if you attempt to move into a wall or 
        # LookUp/Down beyond the allowed range.

The structure of the target definition:
        "agentPositionIndex": 102,	# corresponds to a point on the grid
        "sceneIndex": 0,  # which configuration for the scene (0-4)
        "sceneName": "FloorPlan1",  # the name of the scene
        "startingHorizon": 330.0,  # starting camera horizon (up,down) in degrees of the agent's camera
        "startingRotation": 90.0,  # global rotation of the agent
        "targetAgentHorizon": 60.0,  # where agent should be looking to see the target object
        "targetAgentRotation": 180.0,  # how the agent should be rotated to see the target object
        "targetImage": "images/FloorPlan1/spoon.png",  # path to image file that shows the target image
        "targetObjectId": "Spoon|-01.73|+00.90|-00.95",  # internal id of the target object
        "targetPosition": {
            "x": -0.6949999928474426,
            "y": 0.9800000190734863,
            "z": -1.7799999713897705
        },  # final position to see the target object
        "uuid": "a27cb81c-5f68-4f90-80e8-e01e49ab188d"  # unique identifier for the task

Challenge Submission

The participants should submit a file that includes the taken actions. The code below shows how to use the API. While running against the validation scenes, the target_found() method will never return True. The agent must decide when it has reached the goal and cease performing actions. Once the agent has iterated through all the targets defined in the targets file, you can save the actions to a JSON file and submit to the CodaLab challenge.

#!/usr/bin/env python
import robosims
import json
env = robosims.controller.ChallengeController(
    # Use for OSX
    # unity_path='',
with open("thor-challenge-targets/targets-val.json") as f:
    t = json.loads(
    for target in t:
        # navigate to the target
        event = env.step(action=dict(action='MoveLeft'))


Continuous vs Discrete Track

The challenge has two settings: (1) Discrete (2) Continuous. By default, the framework operates in discrete mode. This means that the agent navigates on a predefined grid. The agent is only able to look up and down in 30 degree increments and rotate in 90 degree increments. If an action is issued that pushes the agent into a wall, the agent will remain on the grid, but the lastActionSuccess value will be set to False. In continuous mode, the agent is free to rotate, look up/down and move in any amount. This allows the agent to collide with walls, move towards an object in fewer steps and rotate with greater precision.

# to enable continuous mode, pass the mode to the constructor
env = robosims.controller.ChallengeController(
    # Use for OSX
    # unity_path='',
Continuous actions:

# All Move actions must include the moveMagnitude parameter
event = env.step(action=dict(action='MoveLeft', moveMagnitude=0.25))

# Rotate the agent to a specific rotation in global space
event = env.step(action=dict(action='Rotate', rotation=75.0))

# Move the horizon of the agent's camera (up/down) 
event = env.step(action=dict(action='Look', horizon=35.0))

# Save actions for submission


The evaluation score is calculated as the average of the optimum number of actions (ot) over the predicted set of actions (pt) over all the tasks:
If the predicted set of actions do not place the agent in the correct location for a task, the score for that task will be zero. The tasks include finding objects in different configurations of a scene from different starting locations.

If you use the THOR framework for your research, please cite the following paper:

  author    = {Yuke Zhu and Roozbeh Mottaghi and Eric Kolve and Joseph J. Lim and Abhinav Gupta and Li Fei{-}Fei and Ali Farhadi},
  title     = {Target-driven Visual Navigation in Indoor Scenes using Deep Reinforcement Learning},
  booktitle = {International Conference on Robotics and Automation (ICRA)},              
  year      = {2017},


Pic 01

Roozbeh Mottaghi

Allen Institute for AI

Pic 02

Eric Kolve

Allen Institute for AI

Pic 03

Daniel Gordon

University of Washington