Third generation: Generalizing with Veo
Our latest breakthrough builds on Veo, Google’s state-of-the-art video generation. A key strength of Veo is its ability to generate videos that capture complex interactions between light, material, texture, and geometry. Its powerful diffusion-based architecture and its ability to be finetuned on a variety of multi-modal tasks enable it to excel at novel view synthesis.
To finetune Veo to transform product images into a consistent 360° video, we first curated a dataset of millions of high quality, 3D synthetic assets. We then rendered the 3D assets from various camera angles and lighting conditions. Finally, we created a dataset of paired images and videos and supervised Veo to generate 360° spins conditioned on one or more images.
We discovered that this approach generalized effectively across a diverse set of product categories, including furniture, apparel, electronics and more. Veo was not only able to generate novel views that adhered to the available product images, but it was also able to capture complex lighting and material interactions (i.e., shiny surfaces), something which was challenging for the first- and second-generation approaches.