GANs are able to perform generation and manipulation tasks, trained on a single video. However, these single video GANs require unreasonable amount of time to train on a single video, rendering them almost impractical. In this paper we question the necessity of a GAN for generation from a single video, and introduce a non-parametric baseline for a variety of generation and manipulation tasks. Inspired by Granot et al. (2021), we revive classical space-time patches-nearest-neighbors approaches and adapt them to a scalable unconditional generative model, without any learning. This simple baseline surprisingly outperforms single-video GANs in visual quality and realism (confirmed by quantitative and qualitative evaluations), and is disproportionately faster (runtime reduced from several days to seconds). Other than diverse video generation, we demonstrate other applications using the same framework, including video analogies and spatio-temporal retargeting. Our proposed approach is easily scaled to Full-HD videos. These observations show that the classical approaches, if adapted correctly, significantly outperform heavy deep learning machinery for these tasks. This sets a new baseline for single-video generation and manipulation tasks, and no less important -- makes diverse generation from a single video practically possible for the first time.
Analogies between all pairs of the 4 Input Videos on the diagonal (marked red)
Each generated video is the combination an input video from its row (layout) and an input video from its column (appearance).
When the spatio-temporal layout is taken from a sketch we call it Sketch-to-Video:
Below are results for the same "style" video (appearance & dynamics) and different spatio-temporal layouts (morphed MNIST digits):
Please click the here for Ablations and comparisons
Please watch in full-screen by pressing the icon (bottom-right in each video).
Note the differences in quality (both resolution and artifacts) and time it took to generate each sample.
(All videos used in Table.1 are here)
Input Video (1280x1920) |
HP-VAE-GAN (144x256) 8 days training |
Ours (1280x1920) 9 mins per video |
Ours (144x256) 18 secs per video |
Input Video (1280x1920) |
HP-VAE-GAN (144x256) 8 days training |
Ours (1280x1920) 9 mins per video |
Ours (144x256) 18 secs per video |
Input Video (720x1280) |
HP-VAE-GAN (144x256) 8 days training |
Ours (720x1280) 6 mins per video |
Ours (144x256) 18 secs per video |
Input Video (1280x1920) |
HP-VAE-GAN (144x256) 8 days training |
Ours (1280x1920) 9 mins per video |
Ours (144x256) 18 secs per video |
Input video is in top-left.
The rest are retargeted to different spatial shapes.
Top: Input video (4 examples)
Bottom: A Summary of the input video (note that the video is shorter)
Watch top video first, then the bottom. See e.g. how in the top, the trainer and the dog are rotating sequentially, and simultaneously in the bottom.
Top: Input video (3 examples)
Bottom: An extension of the input video (note that the video is longer)
Watch top video first, then bottom. See e.g. note in the ballet video how the choreography is longer but the pace of the motions remains the same.
5 exmples total
Top: Original video
Middle: Input video (algorithm only sees this)
Bottom: Output
Note how a blue cue is replaced by a player from Barcelona and a white cue by a player from Real Madrid
Top-Left: Input Video. Others are randomly generated.
Inputs with significant depth variation and large camera motion suffers non-rigid deformations
@inproceedings{haim2022diverse,
title={Diverse generation from a single video made possible},
author={Haim, Niv and Feinstein, Ben and Granot, Niv and Shocher, Assaf and Bagon, Shai and Dekel, Tali and Irani, Michal},
booktitle={European Conference on Computer Vision},
pages={491--509},
year={2022},
organization={Springer}
}