Alibaba’s research team has unveiled a high-fidelity framework for image-to-video generation, named AtomoVideo. The team has shared papers and Image-to-Video examples of AtomoVideo along with samples from Runway’s Gen-2 and also Pika 1.0.

A simpler but less artifact-ridden video output

Keeping in mind that AtomoVideo is a first-generation product, the provided samples do look promising, although they are still far from looking realistic. Surprisingly, comparing it with Runway’s second generation model (Gen-1 was released on February 2023) reveals that this just-unveiled model does a better job of mitigating some weird transitions between frames.

For example, in a comparison sample of an astronaut in space, the reflective glass covering or the visor just vanished from the sample of Gen-2 as he was moving around. While AtomoVideo kept the movement relatively simpler, it didn’t generate such a result. In another comparison sample, Gen-2 depicted people vanishing while skying on the snow while Pika 1.0 showed some weird movement on the slope that is hard to define with physics. AtomoVideo, again, kept it relatively simple but managed to avoid such mistakes. Nonetheless, these comparison samples are most likely some of the curated samples instead of being randomly generated ones.

Key features of Alibaba’s AtomoVideo

Specialties of AtomoVideo include its ability to maintain high fidelity to the input image, ensure smooth motion transitions, and support the prediction of subsequent video frames. Moreover, the framework boasts compatibility with various existing T2I (Text to Image) models and offers high semantic controllability. It allows users to customize video content according to their specific preferences.

AtomoVideo achieves its remarkable performance by leveraging pre-trained T2I models as a foundation and enhancing them with one-dimensional spatiotemporal convolution and attention modules. These additional layers enable the framework to capture intricate details and styles while ensuring temporal consistency throughout the generated videos. By incorporating advanced image semantics through Cross-Attention mechanisms, AtomoVideo further enhances its ability to produce videos with precise semantic control.

Despite the impressive capabilities demonstrated by AtomoVideo, the research team has yet to provide an online platform for users to experience the technology firsthand. Nonetheless, Alibaba’s AtomoVideo framework represents a significant addition to the field of image-to-video synthesis.

Related:

(Source)