T3AL

Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori¹, Alessandro Conti¹, Paolo Rota¹, Yiming Wang², Elisa Ricci^1,2

¹ University of Trento ² Fondazione Bruno Kessler

Paper Code arXiv

Abstract

Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM) at inference time on a sample basis. T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.

Method Overview

Illustration of our proposed framework. T3AL addresses the task of ZS-TAL by only learning at test-time on unlabelled data. We first compare the average visual frames with the textual class names to identify the video pseudo-label . We then refine the visual frames - video pseudo-label scores with self-supervision. Last, we exploit the decoder of a captioning model to generate captions and perform text-guided region suppression. We only fine-tune the vision and language projectors, while keeping the encoders frozen. Once the prediction is obtained, the optimized parameters are re-initialized to the ones of the VL pre-trained model.

Qualitative Results

In each of the following illustrations we show a video from THUMOS14 and the prediction of our proposed method \(T^3AL\). The visual representation includes ground truth and predicted classes, as well as the similarity with the pseudo-label. In the similarity plot, temporal ground truth action intervals are highlighted in green, predicted action proposals in blue, and overlapping areas are indicated by parallel diagonal lines. The red slider visually represents the progression of time within the video. Videos are better seen in full screen.

The predicted class correspods to the ground truth class. The model discriminates between regions in the video with the penalty kick and regions with other actions related to the game of soccer, such as players running or walking on the field and soccer passes. The three false positive regions predicted contain the preparation for the penalty kick and thus are very similar visually to the ground truth action.

The predicted class corresponds to the ground truth class and the predicted regions are aligned with the ground truth ones most of the time. There are two false positive regions predicted, containing a man jumping on the diving board, and one prediction significantly bigger than the corresponding ground truth region, depicting a man doing an handstand on the diving board. In both cases, we can see why the model wrongly assigns relatively high scores to these predictions, as the scenes share visual similarities with the ground truth.

In this example, the ground truth class is misclassified, the model predict the action of throwing an hammer instead of a discus. Moreover, the model predicts two false positive regions, confused by the fact that the scene is similar to the one of the ground truth action.

In this example the predicted class is correct while there is less overlap between the ground truth and the predicted regions. Upon closer inspection of the video, it becomes apparent that the model detects the action in frames where a player is practicing a tennis swing with an elastic band. These frames have not been included in the ground truth regions by annotators, yet it can be argued that they contain elements of the action class in question.

The ground truth class is classified and the two occurrencies of it are detected. The model predicts a false positive region and wrongly considers the aftermath of the action as part of the action itself.

In this example, the predicted class corresponds to the ground truth class. The model predicts a region proposal for each one of the occurrences of the action, yet the predicted regions are longer than the ground truth ones. This can be attributed to the fact that the class name encompasses a wide range of actions associates with the billiard game.

The ground truth class is classified and for the most of the video there is overlap between the ground truth and the predicted regions. The model misses a big part of the last occurrence of the action. Upon closer inspection of the video, it can be seen that it predicts a low similarity with the pseudo-label as the scene contains a man that is running.

The predicted class corresponds to the ground truth one. The model is able to detect large areas where the action takes place, yet it wrongly aggregates multiple regions into five single ones. This can be attributed to the fact that there is a subtle difference between a tennis swing and the preparation or the aftermath of it.

In this example, the action is classified but the similarity with it is high over the whole video and the model predicts many false positive regions. This can be attributed to the fact that there are different moments in the video when the action is partially performed. For instance, there are instances when only the clean is performed without the full clean and jerk, or when the individual is unable to lift the barbell entirely.

The action is classified, and all the ground truth instances are detected. All predicted regions are longer than the ground truth ones because the model considers the aftermath of the action as part of the action itself.