VGPNN: Diverse Generation from a Single Video Made Possible

Weizmann Institute of Science, Rehovot, Israel
* Equal contribution
(This page contains many videos, please give it a minute to load)

Generated Video Samples from a Single Video

Original video   (top-left), all others are generated


GANs are able to perform generation and manipulation tasks, trained on a single video. However, these single video GANs require unreasonable amount of time to train on a single video, rendering them almost impractical. In this paper we question the necessity of a GAN for generation from a single video, and introduce a non-parametric baseline for a variety of generation and manipulation tasks. Inspired by Granot et al. (2021), we revive classical space-time patches-nearest-neighbors approaches and adapt them to a scalable unconditional generative model, without any learning. This simple baseline surprisingly outperforms single-video GANs in visual quality and realism (confirmed by quantitative and qualitative evaluations), and is disproportionately faster (runtime reduced from several days to seconds). Other than diverse video generation, we demonstrate other applications using the same framework, including video analogies and spatio-temporal retargeting. Our proposed approach is easily scaled to Full-HD videos. These observations show that the classical approaches, if adapted correctly, significantly outperform heavy deep learning machinery for these tasks. This sets a new baseline for single-video generation and manipulation tasks, and no less important -- makes diverse generation from a single video practically possible for the first time.

Video Analogies

Analogies between all pairs of the four Input Videos (marked  ) on the diagonal.
Each generated video is the combination of an input video from its row (spatio-temporal layout) and an input video from its column (appearance).

Sketch to Video

When the spatio-temporal layout is taken from a sketch we call it Sketch-to-Video:

And here are the results of our video analogies using the same style video (marked  ) as above with different content videos (smooth transitions of mnist digits from one to the next):

Retargeting over Spatial Dimension

Input video (marked  ) is in top-left. The rest are retargeted to different spatial shapes.

Retargeting over Temporal Dimension

Top: Original video (marked  )
Bottom: Extended (marked  ) or Summarized (marked  )

Conditional Inpainting

We can add/remove parts of the video by marking the relevant region. The color will be used to guide our method to replace it with an object from the video with similar color.
Original video   on top, middle video is input to the algorithm.


  author    = {Haim, Niv and Feinstein, Ben and Granot, Niv and Shocher, Assaf and Bagon, Shai and Dekel, Tali and Irani, Michal},
  title     = {Diverse Generation from a Single Video Made Possible},
  journal   = {arXiv preprint arXiv:2109.08591},
  year      = {2021},