LookWhen? Fast Video Recognition by
Learning When, Where, and What to Compute

Ali Salamatian^1●, Anthony Fuller^2,3●★, Pritam Sarkar^1,3, James R. Green², Leonid Sigal^1,3★, Evan Shelhamer^1,2,3★

¹University of British Columbia • ²Carleton University • ³Vector Institute

^●Co-first author ^★Co-advising author

Preprint

LookWhen token selection on a diving video

LookWhen selects the most unique tokens across space and time on a 16-frame diving clip. Highlighted patches are scored highest by the selector and passed to the deep extractor, coverage concentrates on the diver and the water disturbance, not the static background.

Abstract

Transformers dominate video recognition. They split videos into tokens, and processing them has expensive superlinear computational cost. Yet videos are filled with redundancy, so we can question the need for this expense. We introduce LookWhen, a selector–extractor framework that factorizes video recognition into learning when, where, and what to compute. Our shallow selector gets a scaled-down video and quickly scores all tokens across space-time, while our deep extractor gets the top-K selected tokens to approximate full-video representations without actually processing all the tokens. A key challenge is defining effective supervision for selection and extraction. For selection pre-training, we introduce a score on representations that ranks tokens by uniqueness using a simple nearest-neighbor distance. For extraction pre-training, we distill both a video teacher and an image teacher, for which we normalize its frame-wise representations to learn what changes within videos. Through experiments on Kinetics-400, SSv2, Epic-Kitchens, Diving48, Jester, and Charades, we show that LookWhen achieves a better accuracy-computation trade-off than efficient models and upgraded baselines of similar size. LookWhen Pareto-dominates in accuracy-FLOPs on 9 of 12 cases (6 tasks × 2 settings) and is 6.7× faster than InternVideo2-B at equal accuracy.

Motivation

Video transformers split clips into space-time tokens and process them all with expensive, superlinear computation, yet most tokens are redundant. Static backgrounds, slowly-changing regions, and repeated frames all carry little new information. Existing token reduction methods (ToMe, RLT, vid-TLDR) still process all tokens in early layers before merging or pruning, limiting efficiency. Others skip tokens but cannot recover accuracy at high sparsity.

The key question: which tokens actually matter, and can we decide before running expensive layers? Our answer is to train a lightweight, low-resolution selector that scores every token for uniqueness, and to only run the deep extractor on the selected top-K.

Comparison of InternVideo2 attention, DINOv3 attention, and DINOv3 top1-distance

Why not just use attention maps? Across 8 frames of a wolf entering the scene, InternVideo2 attention and DINOv3 attention stay sparse and noisy until the wolf is large and salient. Our top-1 distance score (each patch's distance to its nearest neighbor in feature space) reacts earlier and covers the wolf more completely as soon as it appears, making it a better target for the selector to learn from.

Method: Selector–Extractor

(a) Inference and fine-tuning: the selector scores downscaled dense tokens, top-K tokens are gathered into sparse tokens, and the extractor processes them into video/frame/patch features. (b) Computing targets for pre-training: a video teacher supplies global video targets; an image teacher's per-frame features are normalized to produce patch, frame, and videotargets, and their top-1 distance produces the selector's uniqueness target.

LookWhen factorizes recognition into three learned decisions:

When & where (Selector): A shallow ViT-B processes a scaled-down video and scores each space-time token by predicted uniqueness. This is fast: low resolution means far fewer tokens, and the network is shallow.
What (Extractor): A deep ViT-B receives only the top-K tokens from the full-resolution video (typically 5–30% of all tokens preset by the user) and extracts rich features from this sparse input.
Pre-training targets:
- For selection: top-1 distance, each patch's distance to its nearest neighbor in the DINOv3 feature space, measuring how unique that patch is relative to other spatial locations and frames.
- For extraction: distillation from two teachers, a video teacher (InternVideo2) for global representations, and a frame-normalized image teacher (DINOv3) to amplify what changes within the video.

Crucially, teachers are only needed during pre-training. At inference and fine-tuning time, LookWhen runs only the selector and extractor, making it both accurate and efficiently deployable.

Results

We compare LookWhen against different ViT-B video models and other adaptive-computation baselines applied to our video teacher (IV2+RLT, IV2+vid-TLDR). LookWhen Pareto-dominates on 9 of 12 accuracy-FLOPs evaluations. Realized throughput gains are even larger: 6.7× faster than InternVideo2-B at equal mean accuracy.

Accuracy vs FLOPs across 6 datasets, linear probe and fine-tune

Accuracy vs. inference FLOPs for linear probing (LP) and fine-tuning (FT) across six datasets. LookWhen (● green) Pareto-dominates IV2+RLT (▲ purple) and IV2+vid-TLDR (■ brown) on 9 of 12 panels, with the largest gains in linear probing, e.g. >10% higher accuracy on Diving48 LP and is sometimes even better than IV2 (☆).

Model	Params (M)	K400 FLOPs	K400 Top-1	SSv2 FLOPs	SSv2 Top-1
ViT-B / Swin-B / Mamba-M baselines
UMT-B800e	87	180×4×3	85.7	180×2×3	70.8
VideoMAEv2	87	180×5×3	81.5	180×2×3	71.2
VideoMambaPro	72	392×4×3	84.0	183×4×3	69.4
VideoMAE + RLT	87	120×4×3	80.1	120×4×3	70.2
VideoMAE + LITE	87	46×5×3	78.4	46×2×3	68.3
LookWhen (ours)
LookWhen (70% sparse)	106	108×4×3	84.6	108×2×3	72.0
LookWhen (90% sparse)	106	40×4×3	82.6	40×2×3	69.3

LookWhen achieves the best accuracy-FLOPs trade-off.

Ablation

We study the key design choices behind LookWhen's pre-training: what the selector should learn to predict, and what the extractor should learn to represent. All ablations use 90% sparsity and evaluate with both linear probing (LP) and fine-tuning (FT) across six datasets.

When & where to select: selector targets

(a) Attention vs. token uniqueness. Training the selector to predict top1-distance (uniqueness in DINOv3 feature space) outperforms selecting highly-attended tokens. (b) Computing token uniqueness. Using K=1 (nearest-neighbor distance) performs best on average.

What to extract: extractor targets

Ablation: extractor pre-training targets

(a) Video-token target. Distilling both InternVideo2's video token and a time-normalized DINOv3 video token (highlighted row) outperforms either teacher alone. (b) Frame and patch-token targets. Adding frame and patch distillation losses to the full IV2+DINOv3 configuration (highlighted row) provides further gains, with the largest improvements on Diving48 (+7.7% LP) and Jester (+10% LP).

Citation

@article{salamatian2026lookwhen, title = {LookWhen? Fast Video Recognition by Learning When, Where, and What to Compute}, author = {Salamatian, Ali and Fuller, Anthony and Sarkar, Pritam and Green, James R. and Sigal, Leonid and Shelhamer, Evan}, journal = {arXiv preprint}, year = {2026} }