Charades Challenge

Recognize and locate activities taking place in a video

The Charades Activity Challenge aims towards automatic understanding of daily activities, by providing realistic videos of people doing everyday activities. The Charades dataset is collected for an unique insight into daily tasks such as drinking coffee, putting on shoes while sitting in a chair, or snuggling with a blanket on the couch while watching something on a laptop. This enables computer vision algorithms to learn from real and diverse examples of our daily dynamic scenarios. The challenge consists of two separate tracks: classification and localization track. The classification track is to recognize all activity categories for given videos ('Activity Classification'), where multiple overlapping activities can occur in each video. The localization track is to find the temporal locations of all activities in a video ('Activity Localization').


News


The Charades Challenge has a winner!

After a heavy competition for the 1st place among the teams from Michigan, Disney Research/Oxford Brookes, Maryland, and DeepMind, TeamKinetics from DeepMind emerged as the winner of the 2017 Charades Challenge, winning both the Classification and Localization tracks.

The challenge significantly raised state-of-the-art accuracy of Human Activity Recognition on Charades from 22% mAP to 34% mAP and showcased many diverse approaches: Variety of video-level and frame-level models, Pre-trained object detectors, and sound features.

Finally, external data played a large role in the competition, where the methods used various combinations of ImageNet/MS-COCO/UCF101/Sports1M/SoundNet/Kinetics/Charades. The new activity dataset from DeepMind, Kinetics, proved superior for pre-training rich models before training on Charades.

The teams will present their work in the Workshop on Visual Understanding Across Modalities @ CVPR 10:30am-12:30am on 7/26, along with invited talks from Jitendra Malik and Ivan Laptev. Everyone is encouraged to attend!


Winners of the Charades Challenge:
Joao Carreira, Brian Zhang, Andrew Zisserman
DeepMind

Classification Track Runner-Up:
Gurkirt Singh, Andreas Lehrmann, Leonid Sigal
Disney research, Oxford Brookes University

Localization Track Runner-Up:
Jonathan Stroud, Kaiyu Yang, Hei Law, Jia Deng
University of Michigan


Action Recognition Results


Rank Team
Accuracy (mAP)
Modeling Approach
1 TeamKinetics 0.3441
I3D ConvNet with dense per-frame outputs
2 DR/OBU 0.2974
Two parallel convolutional neural networks (CNNs) extracting static (i.e., independent) appearance and optical flow features and scores for each frame, plus, there is another parallel audio feature extraction stream using Soundnet CNN, which is scored using a SVM.
3 UMICH-VL 0.2811
We build an ensemble of Temporal Hourglass Networks (THGs), a novel architecture which consists of temporal convolutional layers, applied to several types of frame-wise feature vectors.
4 UMD 0.2535
Ensemble of multiple ResNet with temporal pooling layer
5
SNU_MIPAL_V.DO
0.2278
External-memory based neural network designed for video action recognition and apply frame-by-frameprediction.
6 ACRV_ANU 0.2238
Learn a classifier in a weakly supervised fashion that is able to distinguish the useful frame out of noise in each sequence and use the decision boundary of this classifier to be the descriptor of each sequence for action recognition.
7 sjson718 0.2185 N/A
8 UCF-CRCV 0.2146
3D Convolution using only RGB
9 rohitg 0.1619 N/A
10 YajieGuan 0.1493 N/A

Temporal Segmentation Results


Rank Team
Accuracy (mAP)
Modeling Approach
1 TeamKinetics 0.2072
I3D ConvNet with dense per-frame outputs
2 UMICH-VL 0.1803
We build an ensemble of Temporal Hourglass Networks (THGs), a novel architecture which consists of temporal convolutional layers, applied to several types of frame-wise feature vectors.
3 DR/OBU 0.1796
Two parallel convolutional neural networks (CNNs) extracting static (i.e., independent) appearance and optical flow features and scores for each frame, plus, there is another parallel audio feature extraction stream using Soundnet CNN, which is scored using a SVM.
4 UMD 0.1396
Ensemble of multiple ResNet with temporal pooling layer
5 BU-Disney 0.1336
Building on top of the highly successful faster-RCNN object detection pipeline our method is the first end-to-end proposal generation and classification framework towards activity detection in untrimmed videos.
6 sjson718 0.1065 N/A

Teams


Team Name Team Members
Method Description
External Data
TeamKinetics
Joao Carreira
Brian Zhang
Andrew Zisserman
The I3D model is described in a CVPR paper: https://arxiv.org/abs/1705.07750. The Kinetics dataset is described in another paper: https://arxiv.org/abs/1705.06950. Our results in the leaderboard were obtained using a single RGB model that was finetuned for both video-level and frame-level classification. The model has a stride of 8 frames, and the predictions are made dense using a bilinear interpolation layer. For per-frame classification we pass the action predictions through a long temporal max-pooling layer.

URL: deepmind.com/kinetics
The Kinetics dataset
DR/OBU
Gurkirt Singh(OBU,DR), Andreas Lehrmann(DR), Leonid Sigal(DR)
DR=Disney research
OBU=Oxford Brookes University, UK
At a high level, our approach consists of two parallel convolutional neural networks (CNNs) extracting static (i.e., independent) appearance and optical flow features for each frame, plus, there is another parallel audio feature extraction stream using Soundnet CNN and scoring done using an SVM. We fuse information from three streams using a convex combination of their respective classification scores to obtain a final result.

We train the overall network using a multi-task loss: (1) Classification: Both streams produce a C-dimensional softmax score vector that is trained using back-propagation with a cross-entropy loss; (2) Regression: In addition to the classification scores, the appearance stream also produces 3-dim. coefficients for each class describing the offset from the boundaries of the current action as well as its overall duration. This network path is trained using a smooth L1 loss.

The audio stream consists of feature extraction using pretrained soundet CNN and SVM classifier to produce classification in sliding window fashion. Audio scores are interpolated to the same frame as other two stream outputs.

We generate frame-level scores at 12 fps. For temporal action segmentation, we fuse the scores of three streams at the frame-level using a convex combination. The weights to each stream can be found by cross-validation on the validation set. Finally, we produce a score vector for 25 regularly sampled frames using top-k mean-pooling in a temporal window around those frames. Frame-level score for each class is the mean of the top-20 frame-level scores of class c in a temporal window of size 40. Similarly, we apply top-k mean pooling on the scores for class c for the entire duration of video to obtain video classification scores. We found that top-k value of 40 works well via cross-validation.
Kinetics dataset for pretraining of CNNs and pretrained Soundnet CNN for audio feature extraction.
UMICH-VL
*Jonathan Stroud
*Kaiyu Yang
Hei Law
Jia Deng

* Equal contribution
Affiliation: University of Michigan (all team members)
We propose Temporal Hourglass Networks (THGs), a novel CNN architecture that detects actions
at each frame in a video. This architecture performs repeated inference at multiple temporal scales, which we accomplish with temporal convolutions applied to frame-level image features. Specifically, we first apply several alternating layers of temporal convolutions and downsampling, followed by alternating layers of temporal convolutions and upsampling. We include skip connections to maintain high-resolution features from early in the network and reincorporate them later. We use the features produced by the THG to predict the presence or absence of each action, object, and verb at each video frame. Our final submission is an ensemble of several THGs, each with different sets of framewise input features. We use four input feature types: (1) RGB, (2) Flow, (3) Trajectories, and (4) Objects. RGB and Flow feature vectors are extracted from the final layer of the pretrained VGG models provided with the competition. Trajectory features are extracted from local descriptors (HOG, HOG, MBH) centered around salient trajectories and pooled with Fisher Vectors, as in Wang & Schmid [2]. Object feature maps are extracted via the recently-published Google object detection API [1]. We ensemble the frame-wise predictions from the THGs via a weighted average to create our localization submission. Our classification submission simply takes the maximum of these predictions before ensembling.

[1] Huang, Jonathan, et al. "Speed/accuracy trade-offs for modern convolutional object detectors." arXiv preprint arXiv:1611.10012 (2016).

[2] Wang, Heng, and Cordelia Schmid. "Action recognition with improved trajectories." Proceedings of the IEEE international conference on computer vision. 2013.
We use the provided RGB and Flow models, which are pretrained on ImageNet/UCF101.
We use the provided RGB and Flow models, which are pretrained on ImageNet/UCF101.
We use an object detector trained on MS-COCO.
UMD
Joe Yue-Hei Ng, Larry S. Davis from University of Maryland, College Park
We extend ImageNet pretrained convolutional networks with temporal max pooling to capture temporal context information in the videos, inspired by Ng et al [1]. Instead of aggregating frame features to one single prediction output, max pooling is applied in a sliding window manner to produce dense action classification outputs at each time step. Finally, the per-frame predictions are max-pooled to produce video level classification prediction. We train three deep networks based on ResNet-50, ResNet-101 and ResNet-152, and the final outputs are produced by the weighted averaging of three networks.

[1] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga and G. Toderici. Beyond Short Snippets: Deep Networks for Video Classification. CVPR 2015.
ImageNet
SNU_MIPAL_V.DO
myunggi, & V.DO Inc
gyutae,
sjson718,
david.seungeui.lee,
Seoul National University
We use external-memory based neural network, where the memory store previous feature information. We extractsthe information from memory when current frame appears. The encoded previous feature and the current feature are summed up, before it go through classifier for the prediction. We use a novel memory updating method which uses max-pooling only along depth channel. The prediction is made frame-by-frame and each frame scores are averaged to predict the final score for a video.
None
ACRV_ANU
Jue Wang ANU
Anoop Cherian ANU&ACRV
Most popular deep learning based models for action recognition are designed to generate separate predictions within their short temporal windows, which are often aggregated by heuristic means to assign an action label to the full video segment. Given that not all frames from a video characterize the underlying action, pooling schemes that impose equal importance to all frames might be unfavorable.
In an attempt towards tackling this challenge, we propose a novel pooling scheme, dubbed SVM pooling, based on the notion that among the bag of features generated by a CNN on all temporal windows, there is at least one feature that characterizes the action. To this end, we learn a decision hyperplane that separates this unknown yet useful feature from the rest. Applying multiple instance learning in an SVM setup, we use the parameters of this separating hyperplane as a descriptor for the video. Since these parameters are directly related to the support vectors in a max-margin framework, they serve as robust representations for pooling of the CNN features. We devise a joint optimization objective and an efficient solver that learns these hyperplanes per video and the corresponding action classifiers over the hyperplanes. Arxiv: https://arxiv.org/abs/1704.01716
None
sjson718 N/A
UCF-CRCV
Mahdi M. Kalayeh, UCF, Center for Research in Computer Vision (CRCV)
N/A

Code: https://github.com/MahdiKalayeh/
Pretrained C3D model on Sports 1 Million
rohitg N/A
YajieGuan N/A
BU-Disney
Huijuan Xu (Department of Computer Science, Boston University, USA),
Abir Das (Department of Computer Science, Boston University, USA),
Leonid Sigal (Disney Research, Pittsburg, USA),
Kate Saenko (Department of Computer Science, Boston University, USA)
We propose a fast end-to-end Region Convolutional 3D Network (R-C3D) for activity detection in continuous video streams. The network encodes the frame buffer with fully convolutional 3D filters, proposes activity segments, then classifies and refines them based on pooled features within their boundaries. Our model improves both speed and accuracy compared to existing methods.
arXiv: https://arxiv.org/pdf/1703.07814.pdf
Sports-1M

Competition


The test data has been released and the evaluation server is running!
The evaluation server is being hosted by Codalab and can be found here.

A prize of $3,000 will be awarded separately to the highest scoring entrant of the classification and localization subtasks.

To submit results, participants should:

  • 1) Sign up for a Codalab account by clicking the Sign Up button on the competition page linked above.
  • 2) Click on the Participate tab, agree to the terms and conditions, and click register.
  • 3) If needed, navigate to the Participate tab and click on the Get Data side tab to download the train and dev sets.
  • 4) Navigate to the Learn the Details tab and click on the Evaluation tab in the sidebar to read about the submission format required.
  • 5) After your request is approved, navigate to the Participate tab and then click on the Submit/View Results tab in the sidebar.
  • 6) Click the submit button and upload a results file.
  • 7) After your submission is scored you will have a chance to review it before posting it to the leaderboard.
  • 8) If you have questions, please ask them in the Charades competition forum (located under the Forum tab)

Dataset

The training and validation sets for this challenge come directly from the Charades dataset (http://allenai.org/plato/charades). This includes 9848 videos that contains 66,500 temporal annotations for 157 action classes. A new unseen test set with 2000 videos will be released closer to the submission deadline. The following components are made available:

Training Set: 7985 videos of average length 30.1 seconds with rich annotations for 157 activities including temporal boundaries, objects, verbs.
Validation Set: 1863 video with the same detail of annotations.
Test Set: 2000 videos with withheld ground truth, made available closer to the submission deadline.

The videos in this challenge contain on average 6.8 actions per video and were created by hundreds of people in their own homes.

Files:
README
LICENSE
VU17_Charades.zip (Annotations and evaluation scripts)
Training and Validation videos (scaled to 480p, 13 GB)
Training and Validation videos (original size) (55 GB)
Training and Validation videos as RGB frames at 24fps (76 GB)
Training and Validation videos as Optical Flow at 24fps (45 GB)
Training and Validation videos as Two-Stream FC7 features (RGB stream, 12 GB)
Training and Validation videos as Two-Stream FC7 features (Flow stream, 15 GB)
Code for baseline algorithms @ GitHub

VU2017 Test Set Videos (List of test videos)
README
Testing Videos (scaled to 480p, 2 GB)
Testing Videos (original size) (13 GB)
Testing Videos as RGB frames at 24fps (15 GB)
Testing Videos as Optical Flow at 24fps (10 GB)
Testing Videos as Two-Stream FC7 features (RGB stream, 2.6 GB)
Testing Videos as Two-Stream FC7 features (Flow stream, 3.2 GB)


Dates

The tentative dates for the competition are:

Test set is released June 6th
Final submission July 15th (11:59PM GMT)
Winners announced July 20th
Conference workshop July 26th


Evaluation

The VU2017 Charades challenge has the following two tasks:

Multi-label Video Classification Task
The task accepts submissions that use any available training data to train a model that predicts a score for 157 classes in each video. The performance of the algorithms is evaluated with mean average precision (mAP) across the videos.

Multi-label Action Localization Task
The task accepts submission that use any available training data to train a model that predicts a score for 157 classes for 25 equally spaced time-points in each video. The performance of the algorithms is evaluated with mean average precision (mAP) across all frames in all the videos.

Official evaluation script to evaluate submission files for both Classification and Localization are given in VU17_Charades.zip. For more details please see the README file and VU17_Charades.zip.

For any questions, please contact vuchallenge@allenai.org.


Organizers

Pic 01

Gunnar Sigurdsson

Carnegie Mellon University

Pic 01

Jonghyun Choi

Allen Institute for AI