VGPNN: Diverse Generation from a Single Video Made Possible

Weizmann Institute of Science, Rehovot, Israel
* Equal contribution
(This page contains many videos, please give it a minute to load)

Diverse Generation from a Single Video

Input video (top-left) marked in red. All others are generated.

Abstract

GANs are able to perform generation and manipulation tasks, trained on a single video. However, these single video GANs require unreasonable amount of time to train on a single video, rendering them almost impractical. In this paper we question the necessity of a GAN for generation from a single video, and introduce a non-parametric baseline for a variety of generation and manipulation tasks. Inspired by Granot et al. (2021), we revive classical space-time patches-nearest-neighbors approaches and adapt them to a scalable unconditional generative model, without any learning. This simple baseline surprisingly outperforms single-video GANs in visual quality and realism (confirmed by quantitative and qualitative evaluations), and is disproportionately faster (runtime reduced from several days to seconds). Other than diverse video generation, we demonstrate other applications using the same framework, including video analogies and spatio-temporal retargeting. Our proposed approach is easily scaled to Full-HD videos. These observations show that the classical approaches, if adapted correctly, significantly outperform heavy deep learning machinery for these tasks. This sets a new baseline for single-video generation and manipulation tasks, and no less important -- makes diverse generation from a single video practically possible for the first time.


Video Analogies

Analogies between all pairs of the 4 Input Videos on the diagonal (marked red)
Each generated video is the combination an input video from its row (layout) and an input video from its column (appearance).


Sketch to Video

When the spatio-temporal layout is taken from a sketch we call it Sketch-to-Video:

Below are results for the same "style" video (appearance & dynamics) and different spatio-temporal layouts (morphed MNIST digits):



Video Analogies

Please click the here for Ablations and comparisons


Qualitative Comparison

Please watch in full-screen by pressing the icon (bottom-right in each video).
Note the differences in quality (both resolution and artifacts) and time it took to generate each sample.
(All videos used in Table.1 are here)


Back to Top
Video Missing? --> Refresh Page (F5)

Spatial Retargeting

Input video is in top-left.
The rest are retargeted to different spatial shapes.


Temporal Retargeting - Video Shortened (“Summarization”)

Top: Input video (4 examples)
Bottom: A Summary of the input video (note that the video is shorter)
Watch top video first, then the bottom. See e.g. how in the top, the trainer and the dog are rotating sequentially, and simultaneously in the bottom.


Temporal Retargeting - Video Extended

Top: Input video (3 examples)
Bottom: An extension of the input video (note that the video is longer)
Watch top video first, then bottom. See e.g. note in the ballet video how the choreography is longer but the pace of the motions remains the same.


Conditional Inpainting

5 exmples total
Top: Original video
Middle: Input video (algorithm only sees this)
Bottom: Output
Note how a blue cue is replaced by a player from Barcelona and a white cue by a player from Real Madrid

Limitations

Top-Left: Input Video. Others are randomly generated.
Inputs with significant depth variation and large camera motion suffers non-rigid deformations

Back to Top
Video Missing? --> Refresh Page (F5)

BibTeX

@inproceedings{haim2022diverse,
  title={Diverse generation from a single video made possible},
  author={Haim, Niv and Feinstein, Ben and Granot, Niv and Shocher, Assaf and Bagon, Shai and Dekel, Tali and Irani, Michal},
  booktitle={European Conference on Computer Vision},
  pages={491--509},
  year={2022},
  organization={Springer}
}