HappyHorse-1.0 by Alibaba: World #1 Video Model Now on Clipia

Overview, real demos across eight scenarios, full spec, and the recipe for your first clip

AIApril 28, 202613 min readClipia

Silhouette of a horse made of neon lines on a dark background with holographic leaderboard screens — cover for the HappyHorse-1.0 by Alibaba article

On April 27, 2026, Alibaba officially released HappyHorse-1.0 — the model that climbed to #1 on the Artificial Analysis Video Arena anonymously back in early April and has held the top spot in both tracks ever since: T2V Elo 1,332 (no audio) and I2V Elo 1,391. On Clipia the model is now live in four modes — Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit.

This post covers the stealth-launch story, real demo clips from the public showcase, the full technical spec, and how to ship your first HappyHorse-1.0 clip in a few minutes.

Open HappyHorse-1.0 on Clipia →

From stealth release to public model: the HappyHorse-1.0 story

On April 7, 2026, an entry titled "HappyHorse-1.0" appeared on Artificial Analysis Video Arena — no developer logo, no whitepaper. Within 24 hours it claimed #1 in Text-to-Video and Image-to-Video with a record 74-Elo lead over the previous holder, Seedance 2.0. Caixin Global reported the model accumulated over 12,000 paired blind comparisons in the first 48 hours.

On April 10, Bloomberg and CNBC simultaneously confirmed: the author is Alibaba, developed inside DAMO Academy. Seventeen days later, on April 27, 2026, the public release went live. From that day, third-party platforms can integrate the model — and right out of the gate it is available in four use-case variants.

Why the anonymous launch? Alibaba was direct: to collect clean user votes without brand bias. A clip labelled "Alibaba" would have skewed votes both ways — some boosted, some downvoted by default. Anonymous testing yielded the most honest benchmark.

What Elo on Video Arena tells us

Artificial Analysis Video Arena is the industry standard for evaluating video models. The format is borrowed from LM Arena for language models: users see two clips generated by different models from the same prompt and vote "left wins / right wins / tie". An Elo score is computed across many comparisons — chess-style.

HappyHorse-1.0 standings as of April 28, 2026:

Text-to-Video (no audio) — #1, Elo 1,332. 59-point lead over Seedance 2.0.
Text-to-Video (with audio) — #2, Elo 1,204. Top spot belongs to Veo 3.1 Quality (native synchronized audio).
Image-to-Video (no audio) — #1, Elo 1,391. 66-point lead over Seedance 2.0.
Image-to-Video (with audio) — #2, Elo 1,159. Veo 3.1 leads again.

A 100-point Elo gap means the stronger model wins 64% of comparisons. HappyHorse 59-66-point lead translates to roughly 58% blind-test wins. For an industry where top models usually sit 10-25 Elo apart, that is a substantial gap.

What HappyHorse-1.0 can do: real generations

All eight clips below are real HappyHorse-1.0 generations from the public showcase. They show where the model actually shines — and why it overtook Seedance 2.0 on Arena.

1. Liquid physics

HappyHorse most visible advantage is correct physics for fluids and soft bodies — the area where older models drift fast: milk turns to mush, coffee freezes mid-air, water loses shape on impact.

A barista pours milk into a cup — a long uninterrupted stream without artifacts, natural splashes, correct light refraction. Milk texture stays consistent across the entire shot.

2. Precise motion and impact

In scenes with fast, accurate actions, what matters is not just animation but preserving object geometry between frames. HappyHorse holds the shape of the club, ball, and shoes without blur even at the moment of contact.

Golf putt on a manicured green: the club tracks the right path, the ball starts and rolls naturally, shadow and grass respond correctly — slow-motion in a hands-on documentary style.

3. Complex string and joint physics

A marionette on strings is a classic stress test for video models. The model has to simultaneously hold the puppet form, joint reactions, and the motion of strings in the puppeteer hands.

A wooden Pinocchio marionette in a puppeteer hands: strings tension and slack like real ones, limb joints obey movement, wood stays as wood — not "smearing" between frames.

4. Long takes with high detail

HappyHorse delivers up to 15 seconds of coherent video without the "stitched" feeling. That unlocks long tracking shots and complex interior scenes that older models could only fake with explicit cuts.

A floor trader at a wall of monitors with charts. Through the entire shot the charts remain meaningful, on-screen numbers do not drift, and the subject gaze and gestures look natural.

5. Atmosphere, wind, location portraits

In outdoor scenes HappyHorse pulls noticeably ahead on liveliness — fabrics, hair, ambient light. What matters most: the portrait stays stable across the entire duration without facial drift.

An old sailor on a sailing boat. The wind tousles his hair and beard, his shirt and the rigging move naturally, the sea surface and horizon stay stable, the face never loses identity.

6. Natural child motion

Child kinetics is a separate difficulty class: unpredictable movements, small proportions, frequent occlusions. The model has to keep an object in the hands while not "losing" the character through fast pose changes.

A little girl spinning a hula hoop in a backyard — a complex kinetic pattern. The hoop holds its round shape, hand movement syncs with hip swing, the fence in the background stays still.

7. Humor and surreal scenes

HappyHorse handles realistic and lightly surreal scenes with equal confidence. The model does not break on unusual inputs — it carefully constructs physics inside an imagined situation.

A cat climbs out of a chrome toaster on a kitchen counter while a second cat watches. The toaster chrome reflects fur, room lighting, and the animal motion correctly.

8. Image-to-Video: emotional portrait

I2V is where HappyHorse shines hardest. Feed it a single photo, and the model brings it to life while preserving facial features, lighting, and composition. On portraits like this, competitors drift by the fourth second; HappyHorse holds identity for the full 10-15.

An emotional female portrait — tears, lip tremor, natural micro-movements. Facial features do not "smear" over time — exactly the advantage that delivered the I2V Elo of 1,391.

Animate your photo with HappyHorse I2V →

Technical specs and parameters

All four modes share a single-stream Transformer architecture and the same set of resolutions and durations. The differences are in required inputs and maximum prompt length.

Parameter	T2V	I2V	R2V	Video Edit
Required input	prompt	image (1)	prompt + 1-9 ref images	video + prompt
Duration	3-15 sec (default 5)	3-15 sec	3-15 sec	output up to 15 sec, input 3-60 sec
Resolution	720p / 1080p	720p / 1080p	720p / 1080p	720p / 1080p
Aspect ratio	16:9, 9:16, 1:1, 4:3, 3:4	inherited from image	16:9, 9:16, 1:1, 4:3, 3:4	inherited from video
Prompt length	up to 5,000 chars	up to 5,000 chars	up to 5,000 chars	up to 5,000 chars
Extra assets	—	1 image (≥300px, 1:2.5-2.5:1)	1-9 images (≥400px)	0-5 ref images, audio_setting auto/origin
Seed	0 - 2³¹	0 - 2³¹	0 - 2³¹	0 - 2³¹
Launch	T2V →	I2V →	R2V →	Edit →

What to keep in mind when prepping inputs:

I2V does not accept aspect ratio as a parameter — it is inherited from the input image. Need vertical video for Reels or TikTok? Feed a 9:16 image.
Minimum 300 pixels on the short side for I2V, 400 pixels for R2V. Below that the provider rejects the input.
Reference-to-Video with 9 images is the multi-character mode. Each subject in the prompt is addressed via @character1, @character2, and so on.
Video Edit accepts up to 5 reference images — useful for restyling a scene while keeping the lead character identity locked.

HappyHorse vs Wan 2.7: same company, different teams

The most common misconception in comments: "HappyHorse is just Wan 2.7 rebranded." Alibaba officially refuted this in the Bloomberg comment. Several video-generation teams work in parallel inside DAMO Academy, and HappyHorse-1.0 is a standalone project — not a descendant of the Wan family.

Feature	Wan 2.7	HappyHorse-1.0
Architecture	Dual-stream Transformer with thinking mode	Single-stream Transformer
Strength	Long textual descriptions, multi-shot	Photorealism, physics, I2V consistency
Prompt length	up to 10,000 chars	up to 5,000 chars
Duration	up to 15 seconds	up to 15 seconds
Native audio	no	yes (T2V and I2V)
Reference-to-Video	up to 3 images	up to 9 images
On Clipia since	March 2026	April 2026

In practice: Wan 2.7 is better for long narrative prompts and "describe everything in detail — model fills the rest" scenarios. HappyHorse wins on short prompts focused on visuals — frame, lighting, material, physics. Closer to a photographer workflow than a screenwriter.

Comparison with top competitors: Seedance 2.0, Kling 3.0, Veo 3.1, Grok Imagine

Current top of the video-model leaderboard per Artificial Analysis Video Arena, April 28, 2026:

Model	Elo T2V	Elo I2V	Native audio	Duration	Launch
HappyHorse-1.0	1,332 (#1)	1,391 (#1)	yes	up to 15 sec	Open →
Seedance 2.0	1,273 (#2)	1,325 (#2)	no	up to 12 sec	Open →
Wan 2.7	1,298	—	no	up to 15 sec	Open →
Kling 3.0	1,254	1,298	no	up to 10 sec (Multi-Shot up to 30)	Open →
Veo 3.1 Quality	1,241	1,277	yes (synchronized)	8 sec	Open →
Grok Imagine Video	1,195	1,218	no	up to 10 sec	Open →

When to pick what:

HappyHorse-1.0 — photorealism, physics, long takes, audio as a bonus. The default top pick for most jobs, especially when you need I2V consistency.
Seedance 2.0 — best quality-to-cost ratio on the market. The Fast variant on Clipia (since April 16, 2026) is 2-3× cheaper than the base version with comparable output.
Wan 2.7 — long narrative prompts in English or Chinese, accurate handling of many objects in a scene.
Kling 3.0 Multi-Shot — short films from 3-6 stitched prompts with narrative logic, up to 30 seconds of continuous storytelling.
Veo 3.1 Quality — the only top-5 model with synchronized audio (dialogue, lines, frame-accurate effects). The pick for short ads and TikTok clips with speech.
Grok Imagine Video — niche choice for surreal and art styles where mainstream models feel generic.

How to generate a HappyHorse-1.0 video on Clipia

The model lives on a dedicated HappyHorse-1.0 page or via the general create-video catalog. Each of the four modes opens directly — pick the tile that matches your task:

Text-to-VideoVideo from prompt — up to 15 sec, native audio, 5 aspect ratios Image-to-VideoBring one photo to life — portraits stay stable for 10-15 sec Reference-to-VideoScenes with 1-9 characters via @character1…@character9 Video EditRestyle a finished clip: style, lighting, time of day

Text-to-Video: first clip in a minute

Open the Text-to-Video page, set duration to 5-10 seconds and aspect ratio for your platform (16:9 for YouTube, 9:16 for Reels and TG round videos). HappyHorse responds well to short visual prompts — more isn't better. The sweet spot is style + key scene + camera move + atmosphere.

Cinematic long take: a lone figure walks across a deserted
beach at sunset, warm golden light, drone shot with slow rise,
gentle wind moving hair and clothing, soft surf on the soundtrack,
photorealistic, 1080p

If you want sound, it is enabled automatically in T2V mode. Just describe it in words: "soft footsteps on sand", "distant ocean waves", "cinematic ambient music".

Image-to-Video: bring a still image to life

Open the Image-to-Video page and upload an image (minimum 300 pixels on the short side, JPEG/PNG/WebP). The prompt is optional — the model will invent natural motion on its own. But if you want control, describe the type of movement and intensity.

The scene comes alive with smooth cinematic motion, gentle
camera push-in, soft lighting shifts revealing new details,
atmospheric particles drifting through the frame, natural
micro-movements

For the input image we recommend our own Nano Banana Pro — it holds composition and text better than anything else in the catalog, which translates directly into cleaner animation downstream.

Reference-to-Video: scenes with multiple characters

Open the Reference-to-Video page. The mode takes 1 to 9 reference images. Each character is addressed in the prompt via @character1, @character2, and so on. This is the most controllable mode for complex scenes with dialogue and interaction.

@character1 walks toward @character2 on a sun-drenched city
street, both smiling, exchanging a handshake in slow motion,
light rays and atmospheric dust creating a warm golden glow,
cinematic medium shot

Video Edit: rewrite an existing clip

Open the Video Edit page, upload a finished video (3-60 seconds, up to 2,160 pixels on the long side) and describe what to change — style, lighting, time of day, effects. You can attach up to 5 reference images to lock the visual direction. Setting audio_setting: origin keeps the original audio track; auto regenerates it to match the new visuals.

Go to HappyHorse-1.0 →

FAQ

When did HappyHorse-1.0 land on Clipia?

The model was added to the Clipia catalog on April 28, 2026 — the day after Alibaba public release. All four modes are available: Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit.

How many credits does one generation cost?

Cost depends on duration and resolution. Live prices live in the model catalog and the production matrix: https://clipia.ai/api/models/happy-horse-t2v/pricing-matrix. On Clipia 1 credit ≈ $0.04 (Basic plan — 240 credits for $15).

How is HappyHorse-1.0 different from Wan 2.7? They are both Alibaba.

Both models come from DAMO Academy, but from different teams with different architectures. Wan 2.7 is a dual-stream Transformer with thinking mode, tuned for long textual prompts and multi-shot output. HappyHorse is single-stream, focused on photorealism, physics, and I2V consistency. Plus HappyHorse generates native audio; Wan 2.7 does not.

Why did HappyHorse beat Seedance 2.0 on Arena?

The main advantages — improved fluid and soft-body physics, long uninterrupted shots without style decay, and significantly better subject consistency in Image-to-Video. On the I2V track HappyHorse leads Seedance 2.0 by 66 Elo, which translates to roughly 58% blind-test wins.

What is Elo rating for video models?

Elo is a chess-derived rating system adapted for AI models. On Artificial Analysis Video Arena a user sees two clips from different models for the same prompt and picks the better one. Across hundreds of thousands of comparisons a numerical rating is computed. A 100-point gap means the stronger model wins 64% of the time.

Does it accept non-English prompts?

Yes, HappyHorse-1.0 supports multilingual prompts including English, Chinese, Russian, Spanish, and others. In practice, mixing works best: describe the scene and story in your native language, and add cinematography terms in English ("slow dolly push-in", "shallow depth of field", "backlit golden hour"). For long narrative prompts in Chinese, Wan 2.7 still has the edge.

What platforms and formats does the model output?

HappyHorse outputs MP4 in 720p or 1080p, 24 fps. Aspect ratio is set per generation: 16:9 for YouTube and desktop, 9:16 for Reels, Shorts, and TG round videos, 1:1 for square Instagram posts, 4:3 and 3:4 for artistic and vintage compositions. Duration ranges from 3 to 15 seconds per generation.

#AI #Video AI #News #Alibaba #News

Try it yourself on Clipia

20+ models for video and image generation. No VPN needed.