Gemini Omni · Google DeepMind

Video from anything. Think scene, not model

One neural net takes any combination of inputs — text, up to 7 photos, audio, a source video — and renders a finished clip with sound in one pass. You don't have to stitch it from several models: one model understands every input at once — picture, sound and motion.

How it works

5modes in one model

up to 4Kvideo quality

up to 7references at once

Prompt + hero photo

Hero from the references walks out of a dark corridor onto a rooftop under a dawn sky, slow camera push-in, soft backlight, dust particles in the air, cinematic colour grade.

→

Single pass

→

Video + sound

How it works

Four steps — from idea to a finished clip. No external editors, no separate voice-over models, no monthly subscriptions.

Pick the inputs

Only the text prompt is required. The rest is optional: up to 7 photos (product, hero face, palette), a source video to re-render (up to 30s — the model takes a clip of up to 10s from it), saved characters or saved voices.

Describe the scene

Write a prompt (up to 10,000 characters): what's on screen, how the camera moves, lighting, mood, pacing. The model honours your references and keeps the hero in focus.

One pass — video+sound

Pick a duration (4 / 6 / 8 / 10 sec) and a resolution (720p / 1080p / 4K). The model generates the video and a matching audio track in a single task — no separate voice-over, no extra editing.

Iterate to perfection

Result not quite right? Switch the mode, swap references, refine the prompt. Experiments run on the same credit balance — no extra subscriptions needed.

Five modes — one model

Gemini Omni picks the mode by itself, based on what you upload on the «Create video» page. Upload a photo and it becomes «image-to-video», add a video and it becomes «re-render», upload nothing and it's «text only». No separate model per scenario.

Text to video

Prompt only

Describe the scene in words — the model handles framing, lighting and camera motion. Sound (physics, ambient, voices) is generated alongside the picture by the same neural net. Great for fast sketches, idea checks and first drafts.

Image to video

Up to 7 photo references

Collect key frames, the product, the style and the palette — the model stitches them into one video. Multi-angle shots of the same hero or smooth style transitions become a matter of uploading the right pictures.

Video to video

Re-render a source video

Upload a source up to 30 seconds — the model takes a clip of up to 10 seconds from it and re-renders it in a new style. Change lighting, season, time of day, art style or swap objects by description. Composition, timing and camera motion are preserved.

Character to video

Same hero in every clip

Attach up to three saved characters — face, body and styling stay identical across generations. Useful for ad series, avatars and personal brands.

Audio to video

Voices and music

Attach up to three saved voices — the model generates video with voice-over or an on-camera presenter. Lip movements sync to the speech right at generation, no extra editing.

10 demos from Google I/O 2026

Footage from the Google keynote shows the full range of capabilities: physics and audio, realistic hands and reflections, art styles, object morphing, kinetic typography. All clips served from the Clipia CDN.

Physics + soundMarbles in a wooden maze — every collision with its own sound

Realistic handsComplex hand animation with floating orbs — no artefacts

ReflectionsChrome and mirrors — accurate environment reflections

Object morphingFish turns into a whale — motion and atmosphere preserved

Action replaceSwap the action in an existing clip — same scene, new actors

Style transferArt styles applied to people without losing faces

HologramReal footage rendered as a hologram — futurist aesthetic

Text + actionKinetic typography perfectly matching on-screen action

Clay + explainerStop-motion claymation: a clay explainer clip

Narrated scienceScience explainer with voice-over synced to the action

4 inputs → one video

Google's canonical example: a fern video + a fireflies image + a harp audio track + a text prompt — the model fuses everything into a single scene. That's «one model for every input type» in practice.

Input 1 · VideoFern in the windReal footage — the model picks up the leaf motion and atmosphere.

Input 2 · ImageFireflies on blackAn image adds a visual motif — the model distributes the fireflies across the scene.

Input 3 · AudioHarp soloThe audio sets the rhythm — fireflies pulse to the beat.

Input 4 · Prompt«Make fireflies circle the fern in time with the music»The text binds the previous three inputs into a specific scene direction.

Output · Video

Refine in chat — tweak video with text

Gemini Omni's signature UX: after generating, refine the clip with chat-style messages. The model remembers the conversation and updates only what you asked for — faces, timing and scene logic stay consistent. It's not Photoshop or a «magic wand»; it's a step-by-step dialogue with the model.

«Generate a scene of a DJ in a studio»

«Remove the DJ console from the table but keep everything else»

«Show the same scene from above»

«Move the DJ to an open field at sunset»

Video stylization — from sketch to hologram

Upload an existing video and describe the style — the model re-renders the scene while preserving motion and timing. Object swaps and aesthetic shifts work from a text description, with no masks or manual outlining.

Pencil sketch

Puppet animation

Voxel art

Liquid metal

Hologram

What Gemini Omni does

Four key differences from classic text-to-video models like Veo 3.1, Sora 2, Kling 3 and Seedance 2.0.

Up to 7 references at once

Combine key frames, palette, hero face and style in a single request. Shared budget: 7 «slots» across images, video and characters (each image = 1 slot, video = 2, saved character = 1).

Same hero in every clip

Up to three locked characters. Face and body stay the same between generations — series, avatars and franchises become realistic without training extra models.

Sound generated with the video

Sound — physics, ambient, voices — is generated by the same neural net as the picture, so it stays in sync with on-screen action. No separate voice-over model: everything in one pass.

Up to 4K and 10 seconds

720p for fast drafts, 1080p for social, 4K for final delivery. Lengths 4 / 6 / 8 / 10 seconds for any format.

Where it shines

Six scenarios where having different input types in one model gives a real speed boost.

AdsAds and performance creatives

Short filmShort stories and teasers

AvatarsDigital avatars and presenters

ProductProduct demos and unboxings

Music videoMusic videos and lip-sync

ExplainerExplainer videos and science

Why on Clipia

Three practical reasons — no marketing fluff.

One process for everything

No switching between «from text», «from image» and «from video» across different UIs. Any input — one window, one model, one price.

No Google AI Plus subscription

Google's site requires AI Plus at $20/mo. On Clipia — credits from your plan, any card, support included.

Dozens of models alongside

Sora 2, Veo 3.1, Seedance 2.0, Kling 3, Wan 2.7 — right next to it. Compare a single prompt across models in minutes.

What is Gemini Omni?

Google's new model, announced at Google I/O 2026. Takes text, up to 7 images, video and audio — outputs a 4–10 second clip in up to 4K with sound, in one pass. The key difference from Veo 3.1, Sora 2 and Kling 3 is that it's not a chain of several models — it's one neural net that works across all inputs at once.

Which modes are supported?

Five: «text only», «text + up to 7 photos», «text + a source video up to 30 seconds» (the model takes a clip of up to 10 sec from it), «text + up to 3 saved characters» (to keep the hero consistent across clips) and «text + up to 3 saved voices». The mode is picked automatically — based on which fields you fill in on the «Create video» page.

What's the «reference quota»?

The model has a shared budget of 7 «slots» for your inputs: each image = 1 slot, a source video = 2 slots, a saved character = 1 slot. Saved voices don't count. Example: 5 images + 1 character = 6 slots, fine. 6 images + 1 video = 8, over the limit — the request is rejected before sending.

What durations are available?

Four fixed values: 4, 6, 8 and 10 seconds. In «video-to-video» the duration is set by the slice of the source video (start–end), capped at 10 seconds per run. Google's site caps at 10 seconds too.

What resolutions and formats?

720p (default), 1080p and 4K. Aspect ratio is 16:9 or 9:16. Square 1:1 is not supported at launch (landscape and vertical only, for Shorts/Reels). 4K roughly doubles the cost and generation time relative to 1080p.

How much does a generation cost?

Price depends on resolution and duration. From 40 credits for 720p × 4 seconds up to 145 credits for 4K × 10 seconds. Intermediate combos: 1080p × 8s = 90 credits, 4K × 6s = 115 credits. 4K costs roughly 2× more than 1080p. Credits come from your plan's balance.

What is a «saved character»?

A saved character identifier — the face and body the model keeps consistent across different clips. Useful for series with one hero: ad campaigns, avatar sets, episodes. For now, saved characters are created via support; self-service upload is planned later.

What is a «saved voice»?

Saved voices used for voice-over or an on-screen presenter. The model adjusts lip movement and intonation to the text. Up to 3 saved voices per generation. Voice setup goes through support for now; a public catalogue will come later.

Can I use the videos commercially?

Yes. Generations on any paid Clipia plan (Basic / Standard / Pro / Ultima) can be used in ads, content, products and sold to clients. Google also adds SynthID — an invisible watermark in the video that flags it as AI-generated. It's the new industry standard for transparency.

How is it different from Sora 2, Veo 3.1 and Seedance 2.0?

Sora 2 — cinematic physics and long clips up to 25 seconds. Veo 3.1 — fixed 8 seconds with sound. Kling 3 — 3–15s clips and multi-shot scenes. Seedance 2.0 — up to 9 references and fast generation. Gemini Omni stands out on input breadth: it's the only model where text, up to 7 photos, video, a same-hero character and audio all combine in a single request, all in one neural net. All four are available on Clipia — pick the one that fits the task.

Демо-видео — Google DeepMind, Gemini Omni keynote, май 2026. Зеркалированы на CDN Clipia для скорости загрузки. · оригинал →

Video from anything. Think scene, not model