Multimodal Input Fusion
Combine a text prompt, reference images, audio direction, and a source clip in a single request. Gemini Omni reads all of them together instead of treating each format as a separate step, so the output stays close to your intended look and motion.









