Video understanding tasks such as action recognition and caption generation are crucial for various real-world applications in surveillance, video retrieval, human behavior understanding, etc. In this work, we present a generic recurrent module to detect relationships and interactions between arbitrary object groups for fine-grained video understanding. Our work is applicable to various open domain video understanding problems. In this work, we validate our method on two video understanding tasks with new challenging datasets: fine-grained action recognition on Kinetics and visually grounded video captioning on ActivityNet Captions.
In the following post, we will first introduce the concept and motivation of the proposed method for human action recognition. Secondly, we will show how the same concept can be further extended to generate a sentence description of a video. For details of the proposed method, please refer our paper here.
From object interactions to human action recognition
Recent approaches for video understanding have demonstrated significant improvements over public datasets such as UCF101, HMDB51, Sports1M, THUMOS, ActivityNet, and YouTube8M. They often focus on representing the overall visual scene (coarse-grained) as a sequence of inputs that are combined with temporal pooling methods, e.g. CRF, LSTM, 1D Convolution, attention, and NetVLAD. Given the state-of-the-art methods, it’s relatively easy for a machine to predict playing tennis and playing basketball by relying on overall scene representation.
However, human actions often involve complex interactions across several objects in the scene. These approaches ignore the fine-grained details of the scene and do not infer interactions between various objects in the video. For example, in the figure below, the two snapshots of video frames share similar background scene representations and the representations of the person, i.e. the difference between skiing and snowboarding is how the person interacts with ski and snowboard.
The difference between human actions is how human interacts with certain objects, instead of overly rely on scene representation. For instance, the two video frames have similar scene representation but their human activities are semantically different.
A question naturally comes with the example above is that: can this problem be solved if the machine can detect the object being interacted with?
The answer is No since there can be many different possible interactions between human and common objects. For instance, to discriminately distinguish the difference between dribbling basketball, dunking basketball, and shooting basketball requires the model to identify how a basketball being interacted with the player. Therefore, our goal in this work is not only detecting the objects being interacted with but also identify how they were being interacted with. Excitedly, this is not a trivial problem to solve.
But, we want even more than detecting pairwise object interaction
Typically, object interaction methods (in image domain) focus on pairwise interactions (left). In this work, we are further interested in efficiently model the interactions between arbitrary subgroups of objects, in which the inter-object relationships in one group are detected and objects with significant relationships (i.e. those that serve to improve action recognition or captioning in the end) are attentively selected (right). We define this interaction between groups of selected object relationships as higher-order interactions.
Why are object interactions and temporal reasoning challenging?
We first define objects to be a certain region in the scene that might be used to determine visual relationships and interactions. This can be a rigid object, person, or even regions in the background scene.
Unfortunately, we can only have features, not the classes of the objects
To understand the relationships/interactions between potential objects, ideally, we need to first identify what are these objects in the scene. Running the state-of-the-art object detectors will, however, fail to successfully identify the objects because there exists a cross-domain problem. Furthermore, we are bounded by the number of object classes that were pre-trained in a particular object detection dataset, e.g. 80 classes in MS-COCO. As a result, it’s very likely that the detected objects are labeled as most common objects, like person and cars, or the object detector may miss a potential interest object completely just because it was not trained to detect it.
Limited by these constraints, we can only use the feature representations obtained by Region Proposal Network (RPN). Note that we do not know the corresponding object across time since linking objects through time can be computationally expensive and may not be suitable if the video sequence is long.
As a result, we have variable-lengths of object sets residing in a high-dimensional space that spans across time. Our objective is to efficiently detect higher-order interactions from these rich yet unordered object representation sets across time.
Recurrent Higher-Order Interaction (HOI)
Toward this end, we propose Recurrent Higher-Order Interaction module to dynamically select K groups of arbitrary objects with detected inter-object relationships via learnable attention mechanism. This attentive selection module uses the overall image context representation, the current set of (projected) objects, and previous object interactions to perform K attentive selections via efficient dot-product operations. The higher-order interaction between groups of selected objects is then modeled via concatenation and the following LSTM cell. Please refer to our paper for further detail of the proposed method.
So, what objects and interactions are detected?
Given the nature of the proposed method in selecting the objects for detecting their interactions, we can qualitatively show what are the objects and their interaction detected when predicting human actions.
In the figure above, The top row indicates the original video frame with selected objects (ROIs). The edge of each bounding box of an object is weighted by their importance in making the correct action recognition. We can visualize the regions that the machine sees by setting the weights to the corresponding regions as the transparent ratio. The brighter the region is, the more important this region is. The 3rd row indicates the weight distribution of objects (30 objects in this example). The value in y-axis indicates the importance of a particular object.
In this figure, we show the proposed method correctly predicting Tobogganing.
Identifying Tobogganing essentially needs three elements: toboggan, snow scene, and a human sitting on top of the toboggan. The three key elements are accurately identified and their interactions are highlighted as we can see from t = 1 to t = 3. Note that the model is able to continue tracking the person and toboggan throughout the whole video, even though they appear extremely small towards the end of the video. We can also notice that our method completely ignore the background scene in the last several video frames as they are not informative since they can be easily confused by other 18 action classes involving snow and ice, e.g. Making snowman, Ski jumping, Skiing cross-country, Snowboarding, etc.
From Object interactions to video captioning
In the second part of the blog, we will discuss how the method proposed for modeling object interactions can be extended for generating a sentence description for a video.
Our motivation is quite straightforward. We argue that a sentence description of a scene (for images and videos) can be decomposed into several relationships components. Therefore, we hypothesize that given a set of detected object relationships and interactions, we can then composed them into a complete sentence description.
Our model efficiently explores and grounds caption generation over interactions between arbitrary subgroups of objects, the members of which are determined by a learned attention mechanism as we shown in recognizing human actions.
We first attentively models object inter-relationships and discovers the higher-order interactions for a video. The detected higher-order object interactions (fine-grained) and overall image representation (coarse-grained) are then temporally attended as the visual cue for each word generation.
The same as we show how the model focus on objects and interactions for action recognition. We can also demonstrate how the model uses the objects and interactions for generating each of the words. In the figure above, timestep t indicates the video timestep. We can see that the proposed method often focuses on the person and the wakeboard, and most importantly it highlight the interaction between the two, i.e. the person steps on the wakeboard. It then progressively generates: The man is then shown on the water skiing.
Distinguish interactions when common objects presented
A common problem with the state-of-the-art captioning models is that they often lack the understanding of the relationships and interactions between objects, and this is oftentimes the result of dataset bias. For instance, when the model detects both person and a horse. The caption predictions are very likely to be: A man is riding on a horse, regardless whether if this person has different types of interactions with the horse.
We are thus interested in finding out whether if the proposed method has the ability to distinguish different types of interactions when common objects are presented in the scene. In the example figure shown below, each video shares a common object in the scene – horse. We show the verb (interaction) extracted from a complete sentence as captured by our proposed method.
- People are riding horses.
- A woman is brushing a horse.
- People are playing polo on a field.
- The man ties up the calf.
While all videos involve horses in the scene, our method successfully distinguishes the interactions of the human and the horse.
To summarize, We introduce a computationally efficient fine-grained video understanding approach for discovering higher-order object interactions. Our work on large-scale action recognition and video captioning datasets demonstrate that learning higher-order object relationships provides high accuracy over existing methods at low computation costs. To the best of our knowledge, this is the first work of modeling object interactions on open domain large-scale video datasets.
This post is based on the following paper:
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, Hans Peter Graf.
CVPR 2018. (PDF)