In this paper, the authors proposed a long-term feature bank – supportive information extracted over the entire span of a video – to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.
In this paper, the authors proposed a long-term feature bank that stores a rich, time-indexed representation of the entire movie. Intuitively, the long-term feature bank stores features that encode information about past and (if available) feature scenes, objects, and actions. This information provides a supportive context that allows a video model, such as a 3D Convolutional network, to better infer what is happening in the present.
Long-Term Feature Bank Models
Authors describe how their method can be used for the task of spatial-temporal action localization, where the goal is to detect all actors in a video and classify their actions. Most state-of-the-art methods, combine a ‘backbone’ 3D CNN with a region-based person detector. To process a video, it is split into short clips of 2-5 seconds, which are independently forwarded through the 3D CNN to compute a feature map, which is then used with region proposals and region of interest (RoI) pooling to compute RoI features for each candidate actor. This approach, which captures only short-term information.
The central idea in this method is to extend this approach with two new concepts:
- a long-term feature bank that intuitively acts as a ‘memory’ of what happened during the entire video – the authors compute this as RoI features from detections at regularly sampled time steps; and
- a feature bank operator (FBO) that computes interactions between the short-term RoI features (describing what actors are doing now) and the long-term features. The interactions may be computed through an attentional mechanism, such as a non-local block, or by feature pooling and concatenation.
Long-Term Feature Bank
The goal of the long-term feature bank, L, is to provide relevant contextual information to aid recognition at the current time step. For the task of spatial-temporal action localization, we run a person detector over the entire video to generate a set of detections for each frame.