Direct Motion Models for Assessing Generated Videos

1Google DeepMind, 2Google Research,
(*: equal contribution)
Well reconstructed video Badly reconstructed video Color coding

TRAJAN is a point track autoencoder trained to reconstruct point tracks. TRAJAN enables the automated evaluation of temporal consistency in generated and corrupted videos. Reconstruction scores are calculated via the Average Jaccard (AJ) metric. Higher AJ means better point track reconstruction.

Abstract

A current limitation of video generative video models is that they generate plausible looking frames, but poor motion --- an issue that is not well captured by FVD and other popular methods for evaluating generated videos. Here we go beyond FVD by developing a metric which better measures plausible object interactions and motion. Our novel approach is based on auto-encoding point tracks and yields motion features that can be used to compare distributions of videos (as few as one generated and one ground truth, or as many as two datasets), and reconstruction errors for evaluating motion of single videos. We show that using point tracks instead of pixel reconstruction or action recognition features results in a metric which is markedly more sensitive to temporal distortions in synthetic data, and can predict human evaluations of temporal consistency and realism in generated videos obtained from open-source models better than a wide range of alternatives.

Architecture

TRAJAN Architecture

The trajectory encoder encodes a (variable-sized) set of point trajectories into a compressed motion latent of fixed size using a Perceiver-style transformer architecture. An occlusion flag is used in the attention mask, making the representation invariant to occluded points. The decoder takes this latent and predicts for a query point the point track that goes through this point at all other times, as well as their occlusion flag. By training the autoencoder on different input and query points, the model learns to represent a dense motion field.

TRAJAN captures human evaluations of temporal consistency and realism in generated videos

Example videos from the EvalCrafter [3] and VideoPhy [4] datasets covering 15 different generative video models. TRAJAN captures human judgements of motion consistency, appearance consistency, overall realism, and object interaction realism for 100 videos sampled from each of these 15 models better than all alternatives.

Results
Spearman's rank coefficients between human ratings and automated metrics for a subset of videos from EvalCrafter and VideoPhy (higher is better). Inter-rater sigma is the standard deviation of human responses (lower is better).

Localizing generated video errors in space and time

The Average Jaccard (reconstruction quality -- higher AJ means higher reconstruction quality) from TRAJAN can be used to detect generated video inconsistencies at specific points in space and at specific moments in time. Here we see that it can be used to detect the moment when the glove's appearance morphs, as well as spatially localize the glove as the problematic location in the video.

From left to right: full generated video, Average Jaccard (reconstruction quality) over time, Average Jaccard shown for each point track around the most poorly reconstructed moment in time. Red indicates worse reconstruction, blue indicates better reconstruction.

Comparing motion similarities between videos with TRAJAN

A comparison of the motion in a generated video to its real counterpart. Videos are generated with WALT [2].

Motion similar The distance between the real and generated videos of the crowd in the TRAJAN embedding space is low since they have similar motion despite differences in object appearance.
Motion different The distance between the real and generated videos of the man in the TRAJAN embedding space is high since they have different motion despite similarities in overall appearance.

TRAJAN also identifies synthetic spatiotemporal corruptions than several alternatives

Full video
Original
AJ over time

Spatially corrupted (S)
Detailed score
Spatiotemporally corrupted (ST)
UCF101
Comparing different methods for detecting temporal distortions on the UCF-101 dataset introduced in [1] in terms of (left) Frechet distances in latent space and (right) per-video ordinal scores. The temporal sensitivity is determined by the ratio of the distance for the Spatiotemporal (ST) corruptions divided by the Spatial (S) corruptions. TRAJAN is particularly sensitive to temporal distortions, while motion histograms and appearance-based methods perform worse.