Skip to content
Clipia.
Home
What are we creating?

Pick a mode — Studio opens with the right prompt ready

All templates
Video
Text → videoDescribe a scene and get a clipImage → videoBring a frame to life with motionVideo templatesReady-made scenes and styles
Images
Text → imageGenerate an image from a promptEditModify an existing photoImage templatesReady-made product shots and art
BlogPricingFor Partners
Sign In
  • Home

  • Create Video

  • Create Image

  • Templates

  • My Works

  • Models

  • Support

Clipia.

Think differently — create the impossible.

Product

  • Create Image
  • Create Video
  • AI Models
  • Video Models
  • Image Models
  • Guides
  • Model Rankings
  • Balance

Support

  • About
  • Contact Us
  • Telegram Support

Legal

  • Terms of Service
  • Privacy Policy
  • Payment information
  • Cross-Border Transfers
  • Acceptable Use
  • Cookie Policy
  • Content License
  • Partner Agreement
Terms of Service·Privacy Policy·Cookie Policy·Acceptable Use
© 2026 Clipia.ai. All rights reserved.
  1. Home/
  2. Video Models/
  3. Gemini Omni
Gemini Omni · Google DeepMind

Video from anything. Think scene, not model

One neural net takes any combination of inputs — text, up to 7 photos, audio, a source video — and renders a finished clip with sound in one pass. You don't have to stitch it from several models: one model understands every input at once — picture, sound and motion.

How it works
5modes in one model
up to 4Kvideo quality
up to 7references at once
Prompt + hero photo

Hero from the references walks out of a dark corridor onto a rooftop under a dawn sky, slow camera push-in, soft backlight, dust particles in the air, cinematic colour grade.

→
Single pass
AI
→
Video + sound

How it works

Four steps — from idea to a finished clip. No external editors, no separate voice-over models, no monthly subscriptions.

01

Pick the inputs

Only the text prompt is required. The rest is optional: up to 7 photos (product, hero face, palette), a source video to re-render (up to 30s — the model takes a clip of up to 10s from it), saved characters or saved voices.

02

Describe the scene

Write a prompt (up to 10,000 characters): what's on screen, how the camera moves, lighting, mood, pacing. The model honours your references and keeps the hero in focus.

03

One pass — video+sound

Pick a duration (4 / 6 / 8 / 10 sec) and a resolution (720p / 1080p / 4K). The model generates the video and a matching audio track in a single task — no separate voice-over, no extra editing.

04

Iterate to perfection

Result not quite right? Switch the mode, swap references, refine the prompt. Experiments run on the same credit balance — no extra subscriptions needed.

4 inputs → one video

Google's canonical example: a fern video + a fireflies image + a harp audio track + a text prompt — the model fuses everything into a single scene. That's «one model for every input type» in practice.

Input 1 · VideoFern in the windReal footage — the model picks up the leaf motion and atmosphere.
Input 2 · ImageFireflies on blackAn image adds a visual motif — the model distributes the fireflies across the scene.
Input 3 · AudioHarp soloThe audio sets the rhythm — fireflies pulse to the beat.
Input 4 · Prompt«Make fireflies circle the fern in time with the music»The text binds the previous three inputs into a specific scene direction.
Output · Video

Демо-видео — Google DeepMind, Gemini Omni keynote, май 2026. Зеркалированы на CDN Clipia для скорости загрузки. · оригинал →

Five modes — one model

Gemini Omni picks the mode by itself, based on what you upload on the «Create video» page. Upload a photo and it becomes «image-to-video», add a video and it becomes «re-render», upload nothing and it's «text only». No separate model per scenario.

Text to video

Prompt only

Describe the scene in words — the model handles framing, lighting and camera motion. Sound (physics, ambient, voices) is generated alongside the picture by the same neural net. Great for fast sketches, idea checks and first drafts.

Image to video

Up to 7 photo references

Collect key frames, the product, the style and the palette — the model stitches them into one video. Multi-angle shots of the same hero or smooth style transitions become a matter of uploading the right pictures.

Video to video

Re-render a source video

Upload a source up to 30 seconds — the model takes a clip of up to 10 seconds from it and re-renders it in a new style. Change lighting, season, time of day, art style or swap objects by description. Composition, timing and camera motion are preserved.

Character to video

Same hero in every clip

Attach up to three saved characters — face, body and styling stay identical across generations. Useful for ad series, avatars and personal brands.

Audio to video

Voices and music

Attach up to three saved voices — the model generates video with voice-over or an on-camera presenter. Lip movements sync to the speech right at generation, no extra editing.

10 demos from Google I/O 2026

Footage from the Google keynote shows the full range of capabilities: physics and audio, realistic hands and reflections, art styles, object morphing, kinetic typography. All clips served from the Clipia CDN.

01/10
Physics + soundMarbles in a wooden maze — every collision with its own sound
02/10
Realistic handsComplex hand animation with floating orbs — no artefacts
03/10
ReflectionsChrome and mirrors — accurate environment reflections
04/10
Object morphingFish turns into a whale — motion and atmosphere preserved
05/10
Action replaceSwap the action in an existing clip — same scene, new actors
06/10
Style transferArt styles applied to people without losing faces
07/10
HologramReal footage rendered as a hologram — futurist aesthetic
08/10
Text + actionKinetic typography perfectly matching on-screen action
09/10
Clay + explainerStop-motion claymation: a clay explainer clip
10/10
Narrated scienceScience explainer with voice-over synced to the action

Refine in chat — tweak video with text

Gemini Omni's signature UX: after generating, refine the clip with chat-style messages. The model remembers the conversation and updates only what you asked for — faces, timing and scene logic stay consistent. It's not Photoshop or a «magic wand»; it's a step-by-step dialogue with the model.

1
«Generate a scene of a DJ in a studio»
2
«Remove the DJ console from the table but keep everything else»
3
«Show the same scene from above»
4
«Move the DJ to an open field at sunset»

Video stylization — from sketch to hologram

Upload an existing video and describe the style — the model re-renders the scene while preserving motion and timing. Object swaps and aesthetic shifts work from a text description, with no masks or manual outlining.

Pencil sketch
Puppet animation
Voxel art
Liquid metal
Hologram

What Gemini Omni does

Four key differences from classic text-to-video models like Veo 3.1, Sora 2, Kling 3 and Seedance 2.0.

Up to 7 references at once

Combine key frames, palette, hero face and style in a single request. Shared budget: 7 «slots» across images, video and characters (each image = 1 slot, video = 2, saved character = 1).

Same hero in every clip

Up to three locked characters. Face and body stay the same between generations — series, avatars and franchises become realistic without training extra models.

Sound generated with the video

Sound — physics, ambient, voices — is generated by the same neural net as the picture, so it stays in sync with on-screen action. No separate voice-over model: everything in one pass.

Up to 4K and 10 seconds

720p for fast drafts, 1080p for social, 4K for final delivery. Lengths 4 / 6 / 8 / 10 seconds for any format.

Where it shines

Six scenarios where having different input types in one model gives a real speed boost.

AdsAds and performance creatives
Short filmShort stories and teasers
AvatarsDigital avatars and presenters
ProductProduct demos and unboxings
Music videoMusic videos and lip-sync
ExplainerExplainer videos and science

Why on Clipia

Three practical reasons — no marketing fluff.

One process for everything

No switching between «from text», «from image» and «from video» across different UIs. Any input — one window, one model, one price.

No Google AI Plus subscription

Google's site requires AI Plus at $20/mo. On Clipia — credits from your plan, any card, support included.

Dozens of models alongside

Sora 2, Veo 3.1, Seedance 2.0, Kling 3, Wan 2.7 — right next to it. Compare a single prompt across models in minutes.

FAQ

Quick answers on modes, limits and pricing.

Google's new model, announced at Google I/O 2026. Takes text, up to 7 images, video and audio — outputs a 4–10 second clip in up to 4K with sound, in one pass. The key difference from Veo 3.1, Sora 2 and Kling 3 is that it's not a chain of several models — it's one neural net that works across all inputs at once.

Five: «text only», «text + up to 7 photos», «text + a source video up to 30 seconds» (the model takes a clip of up to 10 sec from it), «text + up to 3 saved characters» (to keep the hero consistent across clips) and «text + up to 3 saved voices». The mode is picked automatically — based on which fields you fill in on the «Create video» page.

The model has a shared budget of 7 «slots» for your inputs: each image = 1 slot, a source video = 2 slots, a saved character = 1 slot. Saved voices don't count. Example: 5 images + 1 character = 6 slots, fine. 6 images + 1 video = 8, over the limit — the request is rejected before sending.

Four fixed values: 4, 6, 8 and 10 seconds. In «video-to-video» the duration is set by the slice of the source video (start–end), capped at 10 seconds per run. Google's site caps at 10 seconds too.

720p (default), 1080p and 4K. Aspect ratio is 16:9 or 9:16. Square 1:1 is not supported at launch (landscape and vertical only, for Shorts/Reels). 4K roughly doubles the cost and generation time relative to 1080p.

Price depends on resolution and duration. From 40 credits for 720p × 4 seconds up to 145 credits for 4K × 10 seconds. Intermediate combos: 1080p × 8s = 90 credits, 4K × 6s = 115 credits. 4K costs roughly 2× more than 1080p. Credits come from your plan's balance.

A saved character identifier — the face and body the model keeps consistent across different clips. Useful for series with one hero: ad campaigns, avatar sets, episodes. For now, saved characters are created via support; self-service upload is planned later.

Saved voices used for voice-over or an on-screen presenter. The model adjusts lip movement and intonation to the text. Up to 3 saved voices per generation. Voice setup goes through support for now; a public catalogue will come later.

Yes. Generations on any paid Clipia plan (Basic / Standard / Pro / Ultima) can be used in ads, content, products and sold to clients. Google also adds SynthID — an invisible watermark in the video that flags it as AI-generated. It's the new industry standard for transparency.

Sora 2 — cinematic physics and long clips up to 25 seconds. Veo 3.1 — fixed 8 seconds with sound. Kling 3 — 3–15s clips and multi-shot scenes. Seedance 2.0 — up to 9 references and fast generation. Gemini Omni stands out on input breadth: it's the only model where text, up to 7 photos, video, a same-hero character and audio all combine in a single request, all in one neural net. All four are available on Clipia — pick the one that fits the task.

Try Gemini Omni

Assemble a prompt, upload references — get a video with sound in a couple of minutes.

No separate Google AI Plus subscription. Pay with credits from your Clipia plan.

Gemini Omni — Google's multimodal AI video model | Clipia.ai