Google VideoPoet: A New AI Tool to Create Videos from Text

The generative AI space has witnessed an explosion in the past couple of years. It all started with OpenAI’s ChatGPT, which is a text-to-text LLM (Large Language Model), and eventually, we got tools like DALL-E & Midjourney that allowed us to convert text-to-images.

If you thought that was impressive, you would certainly be amazed to hear about this new tool called Google VideoPoet, an AI-powered tool that is set to transform content creation further.

Although currently unavailable to the public, the preview unveiled by Google in December 2023 indicates that VideoPoet will open new gates of storytelling.

Let’s take a deep dive into Google VideoPoet’s features and capabilities. But first, we need to understand how the tool works.

What Exactly is Google VideoPoet?

VideoPoet is a simple modeling method that aims to create zero-shot videos from text prompts using the two main components of this tool are a pre-trained MAGVIT V2 video and a SoundStream audio tokenizer.

MAGVIT V2 video tokenizer takes all the different bits of a video clip and turns them into a special code that’s compatible with text-based language models. SoundStream audio tokenizer performs the same tasks for audio clips.

Another essential component of the tool is the autoregressive language model. This is like the brain of VideoPoet. It learns from all the videos, pictures, and sounds, passed on from the video and audio tokenizer to create videos based on the prompt.

Such an architecture enables VideoPoet to integrate a mixture of video generation capabilities within a single LLM framework. This is what makes Google’s VideoPoet unique. Other AI video generation tools rely on separately trained components that specialize in each task.

The ingenious design allows VideoPoet to offer versatile features:

Text-to-Video
Image-to-Video
Video Stylization
Inpainting/Outpainting
Video-to-Audio

Google VideoPoet also supports generating videos in a square orientation or portrait mode, ensuring the tool can also be useful for creating short-form content.

Features & Capabilities of Google VideoPoet:

1. Text-to-Video

This feature works similarly to other text-to-video tools. You just have to type what you want to see on the screen, and the tool will generate it.

For example, if you give the prompt, “Robot DJ playing the turntable in heavy rain, cyberpunk, neon lights, reflective surfaces”, the tool will generate a video that closely resembles this description.

2. Image-to-Video

Another mind-blowing feature of VideoPoet is its ability to convert images into dynamic video. All you have to do is feed an image and generate a video matching a given text prompt.

Suppose the input image is the painting of Mona Lisa with the following prompt “A woman yawning”.

3. Video Editing

Video editing is yet another unique feature of VideoPoet that allows you to change prompts over time to craft visual narratives.

Let’s understand this with an example, suppose the input video is:

You can add a prompt to make the story more exciting, and VideoPoet will follow your command to edit the video accordingly.

The prompt is: “Two raccoons on motorbikes. A meteor shower falls behind the raccoons. The meteors impact the earth and explode“.

VideoPoet’s default setting is to generate 2-second long videos. However, the tool is perfectly capable of generating longer videos.

If you feed VideoPoet a 1-second long video, it will analyze it, predict what happens next, and create another 1-second scene. Then, it will take that newly generated scene and repeat the process, predicting the next 1 second based on the previous 2. This chain reaction continues, adding 1 second at a time until you have a video as long as you desire.

One noteworthy thing about VideoPoet is that despite the short input context, the model retains the identity and characteristics of the objects and the overall scene throughout the extended video. A lot of the other video-generation tools struggle to achieve this level of consistency in their output.

Interactive Video Editing

With an Interactive video editing feature, VideoPoet puts you in the director’s chair, allowing you to fine-tune and personalize your videos. It presents you with multiple different outputs, showcasing variations in the motion and actions within the extended scene.

You have the option to further edit the video you choose from the list of candidates.

Controllable Video Editing

VideoPoet goes beyond the basic edits and allows you to precisely manipulate the motion of objects and characters. Simply type in the motion you want, and the tool will give you the exact output.

4. Stylization

Apart from giving your characters fancy moves, VideoPoet also enables you to stylize your video in a unique, creative way. You simply describe the artistic style you desire through a text prompt. It can be anything; perhaps you want your video to look like a classic oil painting. Or perhaps a vibrant anime scene.

VideoPoet analyzes your words and applies them to your video, infusing it with the chosen style’s characteristics, colors, and textures.

Check out these inputs and results:

This feature can also be applied to the text-to-video generation method. All you have to do is start with a basic prompt, and eventually, you can add the desired style to it.

5. Inpainting and Outpainting

If you feel like your video is missing a piece, VideoPoet lets you magically erase the unwanted elements and seamlessly fill in the empty spaces. Mask out the area you want to repair, and VideoPoet uses its AI magic to generate realistic and consistent content that blends perfectly with the surrounding video.

Conclusion

Google VideoPoet represents a remarkable leap forward in AI-driven content creation. With its diverse video generation capabilities, VideoPoet empowers creators to bring their visions to life with ease and flexibility.

As we embrace the endless possibilities of this innovative tool, we embark on a journey into a new era of boundless creativity and storytelling potential.

Google VideoPoet: A New AI Tool to Generate Videos from Text