Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM) at inference time on a sample basis. T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
In each of the following illustrations we show a video from THUMOS14 and the prediction of our proposed method \(T^3AL\). The visual representation includes ground truth and predicted classes, as well as the similarity with the pseudo-label. In the similarity plot, temporal ground truth action intervals are highlighted in green, predicted action proposals in blue, and overlapping areas are indicated by parallel diagonal lines. The red slider visually represents the progression of time within the video. Videos are better seen in full screen.