Physics + soundMarbles in a wooden maze — every collision with its own sound
One neural net takes any combination of inputs — text, up to 7 photos, audio, a source video — and renders a finished clip with sound in one pass. You don't have to stitch it from several models: one model understands every input at once — picture, sound and motion.
Hero from the references walks out of a dark corridor onto a rooftop under a dawn sky, slow camera push-in, soft backlight, dust particles in the air, cinematic colour grade.
Four steps — from idea to a finished clip. No external editors, no separate voice-over models, no monthly subscriptions.
Only the text prompt is required. The rest is optional: up to 7 photos (product, hero face, palette), a source video to re-render (up to 30s — the model takes a clip of up to 10s from it), saved characters or saved voices.
Write a prompt (up to 10,000 characters): what's on screen, how the camera moves, lighting, mood, pacing. The model honours your references and keeps the hero in focus.
Pick a duration (4 / 6 / 8 / 10 sec) and a resolution (720p / 1080p / 4K). The model generates the video and a matching audio track in a single task — no separate voice-over, no extra editing.
Result not quite right? Switch the mode, swap references, refine the prompt. Experiments run on the same credit balance — no extra subscriptions needed.
Google's canonical example: a fern video + a fireflies image + a harp audio track + a text prompt — the model fuses everything into a single scene. That's «one model for every input type» in practice.
Демо-видео — Google DeepMind, Gemini Omni keynote, май 2026. Зеркалированы на CDN Clipia для скорости загрузки. · оригинал →
Gemini Omni picks the mode by itself, based on what you upload on the «Create video» page. Upload a photo and it becomes «image-to-video», add a video and it becomes «re-render», upload nothing and it's «text only». No separate model per scenario.
Describe the scene in words — the model handles framing, lighting and camera motion. Sound (physics, ambient, voices) is generated alongside the picture by the same neural net. Great for fast sketches, idea checks and first drafts.
Collect key frames, the product, the style and the palette — the model stitches them into one video. Multi-angle shots of the same hero or smooth style transitions become a matter of uploading the right pictures.
Upload a source up to 30 seconds — the model takes a clip of up to 10 seconds from it and re-renders it in a new style. Change lighting, season, time of day, art style or swap objects by description. Composition, timing and camera motion are preserved.
Attach up to three saved characters — face, body and styling stay identical across generations. Useful for ad series, avatars and personal brands.
Attach up to three saved voices — the model generates video with voice-over or an on-camera presenter. Lip movements sync to the speech right at generation, no extra editing.
Footage from the Google keynote shows the full range of capabilities: physics and audio, realistic hands and reflections, art styles, object morphing, kinetic typography. All clips served from the Clipia CDN.
Gemini Omni's signature UX: after generating, refine the clip with chat-style messages. The model remembers the conversation and updates only what you asked for — faces, timing and scene logic stay consistent. It's not Photoshop or a «magic wand»; it's a step-by-step dialogue with the model.
Upload an existing video and describe the style — the model re-renders the scene while preserving motion and timing. Object swaps and aesthetic shifts work from a text description, with no masks or manual outlining.
Four key differences from classic text-to-video models like Veo 3.1, Sora 2, Kling 3 and Seedance 2.0.
Combine key frames, palette, hero face and style in a single request. Shared budget: 7 «slots» across images, video and characters (each image = 1 slot, video = 2, saved character = 1).
Up to three locked characters. Face and body stay the same between generations — series, avatars and franchises become realistic without training extra models.
Sound — physics, ambient, voices — is generated by the same neural net as the picture, so it stays in sync with on-screen action. No separate voice-over model: everything in one pass.
720p for fast drafts, 1080p for social, 4K for final delivery. Lengths 4 / 6 / 8 / 10 seconds for any format.
Six scenarios where having different input types in one model gives a real speed boost.
Three practical reasons — no marketing fluff.
No switching between «from text», «from image» and «from video» across different UIs. Any input — one window, one model, one price.
Google's site requires AI Plus at $20/mo. On Clipia — credits from your plan, any card, support included.
Sora 2, Veo 3.1, Seedance 2.0, Kling 3, Wan 2.7 — right next to it. Compare a single prompt across models in minutes.
Quick answers on modes, limits and pricing.
Google's new model, announced at Google I/O 2026. Takes text, up to 7 images, video and audio — outputs a 4–10 second clip in up to 4K with sound, in one pass. The key difference from Veo 3.1, Sora 2 and Kling 3 is that it's not a chain of several models — it's one neural net that works across all inputs at once.
Five: «text only», «text + up to 7 photos», «text + a source video up to 30 seconds» (the model takes a clip of up to 10 sec from it), «text + up to 3 saved characters» (to keep the hero consistent across clips) and «text + up to 3 saved voices». The mode is picked automatically — based on which fields you fill in on the «Create video» page.
The model has a shared budget of 7 «slots» for your inputs: each image = 1 slot, a source video = 2 slots, a saved character = 1 slot. Saved voices don't count. Example: 5 images + 1 character = 6 slots, fine. 6 images + 1 video = 8, over the limit — the request is rejected before sending.
Four fixed values: 4, 6, 8 and 10 seconds. In «video-to-video» the duration is set by the slice of the source video (start–end), capped at 10 seconds per run. Google's site caps at 10 seconds too.
720p (default), 1080p and 4K. Aspect ratio is 16:9 or 9:16. Square 1:1 is not supported at launch (landscape and vertical only, for Shorts/Reels). 4K roughly doubles the cost and generation time relative to 1080p.
Price depends on resolution and duration. From 40 credits for 720p × 4 seconds up to 145 credits for 4K × 10 seconds. Intermediate combos: 1080p × 8s = 90 credits, 4K × 6s = 115 credits. 4K costs roughly 2× more than 1080p. Credits come from your plan's balance.
A saved character identifier — the face and body the model keeps consistent across different clips. Useful for series with one hero: ad campaigns, avatar sets, episodes. For now, saved characters are created via support; self-service upload is planned later.
Saved voices used for voice-over or an on-screen presenter. The model adjusts lip movement and intonation to the text. Up to 3 saved voices per generation. Voice setup goes through support for now; a public catalogue will come later.
Yes. Generations on any paid Clipia plan (Basic / Standard / Pro / Ultima) can be used in ads, content, products and sold to clients. Google also adds SynthID — an invisible watermark in the video that flags it as AI-generated. It's the new industry standard for transparency.
Sora 2 — cinematic physics and long clips up to 25 seconds. Veo 3.1 — fixed 8 seconds with sound. Kling 3 — 3–15s clips and multi-shot scenes. Seedance 2.0 — up to 9 references and fast generation. Gemini Omni stands out on input breadth: it's the only model where text, up to 7 photos, video, a same-hero character and audio all combine in a single request, all in one neural net. All four are available on Clipia — pick the one that fits the task.