Direct Motion Models for Assessing Generated Videos

Kelsey Allen^1*, Carl Doersch¹, Guangyao Zhou¹, Mohammed Suhail¹, Danny Driess¹, Ignacio Rocco¹, Yulia Rubanova¹, Thomas Kipf¹ Mehdi S. M. Sajjadi¹, Kevin Murphy¹, Joao Carreira¹, Sjoerd van Steenkiste^2*

¹Google DeepMind, ²Google Research,

(*: equal contribution)

arXiv Code

TRAJAN is a point track autoencoder trained to reconstruct point tracks. TRAJAN enables the automated evaluation of temporal consistency in generated and corrupted videos. Reconstruction scores are calculated via the Average Jaccard (AJ) metric. Higher AJ means better point track reconstruction.

Abstract

A current limitation of video generative video models is that they generate plausible looking frames, but poor motion --- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), and reconstruction errors for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives.

Architecture

The trajectory encoder encodes a (variable-sized) set of point trajectories into a compressed motion latent of fixed size using a Perceiver-style transformer architecture. An occlusion flag is used in the attention mask, making the representation invariant to occluded points. The decoder takes this latent and predicts for a query point the point track that goes through this point at all other times, as well as their occlusion flag. By training the autoencoder on different input and query points, the model learns to represent a dense motion field.

TRAJAN captures human evaluations of temporal consistency and realism in generated videos

Example videos from the EvalCrafter [3] and VideoPhy [4] datasets covering 15 different generative video models. TRAJAN captures human judgements of motion consistency, appearance consistency, overall realism, and object interaction realism for 100 videos sampled from each of these 15 models better than all alternatives.

Results — Spearman's rank coefficients between human ratings and automated metrics for a subset of videos from EvalCrafter and VideoPhy (higher is better). Inter-rater sigma is the standard deviation of human responses (lower is better).

Localizing generated video errors in space and time

The Average Jaccard (reconstruction quality -- higher AJ means higher reconstruction quality) from TRAJAN can be used to detect generated video inconsistencies at specific points in space and at specific moments in time. Here we see that it can be used to detect the moment when the glove's appearance morphs, as well as spatially localize the glove as the problematic location in the video.

From left to right: full generated video, Average Jaccard (reconstruction quality) over time, Average Jaccard shown for each point track around the most poorly reconstructed moment in time. Red indicates worse reconstruction, blue indicates better reconstruction.

Comparing motion similarities between videos with TRAJAN

A comparison of the motion in a generated video to its real counterpart. Videos are generated with WALT [2].

TRAJAN also identifies synthetic spatiotemporal corruptions than several alternatives

Detailed score — Spatiotemporally corrupted (ST)