Microsoft VASA-1 AI can make a single image sing or talk

AI and its models are evolving fast. From generating images, videos, and audio to making audio and video clips from a single image, there has been a tremendous improvement. Microsoft Research has announced has introduced VASA-1, an AI model that can make an image sing or talk. It makes an image into a video clip with audio and the facial expressions suiting the audio.

Microsoft VASA-1 AI can make a single image sing or talk

VASA is a new AI model from Microsoft that can generate hyper-realistic talking faces from a single image. You only need to input an image and a single audio clip to get a realistic video clip. VASA model not only lip-syncs the audio but also generates facial nuances and natural head moments to suit the audio and create a realistic impact.

Microsoft just dropped VASA-1.

This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba

10 wild examples:

1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD

— Min Choi (@minchoi) April 18, 2024

VASA model can deliver high-quality video output and significantly outperforms other models capable of generating videos. It can also generate 512×512 videos online at up to 40 FPS with very negligible latency. This model can be very helpful to create lifelike avatars that emulate human conversational behaviors.

With VASA, users get to control the video generation by inputting conditions for eye gaze, head distance, and other emotional offsets. The model can handle artistic photos like the Mona Lisa, singing audios, and non-English speeches to generate hyper-realistic videos.

Microsoft in its research paper added that the research is focused on generating visual affective skills for virtual AI avatars intended for positive use cases. Any content that is generated with its mode intended to mislead or deceive is against their policies. Microsoft has acknowledged that like other models, this can also be used for impersonating humans. They are confident that there is a gap between what it is now able to achieve and the authentic real videos.

There are no plans from Microsoft to release the online demo, API, additional implementation details, or any other related offerings to the public until they are confident that the tools will be used responsibly and following proper regulations.

Microsoft VASA-1 AI can make a single image sing or talk

Guru@TWCN