If a picture is worth a thousand words, a video is worth a million. With Veo 3.1 — Google's state-of-the-art video generation model — you can now direct AI-generated video with professional-grade creative controls, multiple aspect ratios, and rich synchronized audio.
This guide is a framework for directing Veo 3.1, which marks a shift from simple generation to true creative control. Whether you're making product demos, short films, or social media content, these techniques will help you get dramatically better results on HeyMarmot.
What You'll Learn
- Learn Veo 3.1's full range of capabilities
- Implement a formula to direct scenes with consistent characters and styles
- Direct video and sound using professional cinematic techniques
- Execute complex ideas by combining Veo with Nano Banana in advanced workflows
Veo 3.1 Model Capabilities
Before diving into prompting techniques, let's understand what Veo 3.1 can do.
Core Generation Features
| Feature | Details |
|---|---|
| Resolution | 720p or 1080p |
| Aspect Ratio | 16:9 or 9:16 |
| Clip Length | 4, 6, or 8 seconds |
| Audio | Rich synchronized sound, dialogue, and SFX |
| Scene Comprehension | Deep understanding of narrative structure and cinematic styles |
Advanced Creative Controls
- Image-to-Video: Animate a source image with strong prompt adherence and enhanced audio-visual quality
- Ingredients to Video: Provide reference images of a scene, character, object, or style to maintain a consistent aesthetic across multiple shots — now with audio generation
- First and Last Frame: Generate natural video transitions between a start image and end image, complete with audio
- Add/Remove Object: Introduce new objects or remove existing ones while preserving the scene's original composition
- Digital Watermarking: All generated videos are marked with SynthID
A Formula for Effective Prompts
A structured prompt yields consistent, high-quality results. Use this five-part formula:
[Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance]
| Element | Purpose | Examples |
|---|---|---|
| Cinematography | Camera work and shot composition | "Medium shot," "Crane shot," "Dolly in" |
| Subject | Main character or focal point | "A tired corporate worker," "A young explorer" |
| Action | What the subject is doing | "Rubbing his temples," "Pushing aside a vine" |
| Context | Environment and background | "In a cluttered office late at night" |
| Style & Ambiance | Aesthetic, mood, and lighting | "Retro aesthetic, slightly grainy, warm tones" |
Example prompt:
Medium shot, a tired corporate worker, rubbing his temples in exhaustion, in front of a bulky 1980s computer in a cluttered office late at night. The scene is lit by the harsh fluorescent overhead lights and the green glow of the monochrome monitor. Retro aesthetic, shot as if on 1980s color film, slightly grainy.
Essential Prompting Techniques
The Language of Cinematography
The Cinematography element is your most powerful tool for conveying tone and emotion.
Camera movement vocabulary:
- Dolly shot — Camera moves toward or away from the subject on a track
- Tracking shot — Camera follows the subject horizontally
- Crane shot — Camera rises or descends vertically, often revealing scale
- Aerial view — High overhead perspective
- Slow pan — Gradual horizontal camera rotation
- POV shot — First-person perspective from a character's eyes
Example — Crane shot:
Crane shot starting low on a lone hiker and ascending high above, revealing they are standing on the edge of a colossal, mist-filled canyon at sunrise, epic fantasy style, awe-inspiring, soft morning light.
Composition keywords: Wide shot, Close-up, Extreme close-up, Low angle, Two-shot
Lens & focus keywords: Shallow depth of field, Wide-angle lens, Soft focus, Macro lens, Deep focus
Example — Shallow depth of field:
Close-up with very shallow depth of field, a young woman's face, looking out a bus window at the passing city lights with her reflection faintly visible on the glass, inside a bus at night during a rainstorm, melancholic mood with cool blue tones, moody, cinematic.
Directing the Soundstage
Veo 3.1 can generate a complete soundtrack from your text instructions. Use these three techniques:
Dialogue — Use quotation marks for specific speech:
A woman says, "We have to leave now."
Sound effects (SFX) — Describe sounds with clarity:
SFX: thunder cracks in the distance
Ambient noise — Define the background soundscape:
Ambient noise: the quiet hum of a starship bridge
Combining all three in a single prompt gives Veo the information it needs to create a fully realized audio-visual scene.
Mastering Negative Prompts
To refine your output, describe what you wish to exclude — but do it positively. Instead of saying "no buildings," describe "a desolate landscape with no buildings or roads." Frame exclusions as part of your scene description for better results.
Prompt Enhancement with Gemini
If you're struggling to add enough detail, use Gemini (or any text AI) to analyze and enrich a simple prompt with more descriptive and cinematic language before feeding it to Veo.
Advanced Creative Workflows
While a single detailed prompt is powerful, multi-step workflows offer unparalleled control. Here's how to combine Veo 3.1 with Nano Banana for complex creative visions.
Workflow 1: Dynamic Transitions with "First and Last Frame"
Create controlled camera movements or transformations between two distinct points of view.
Step 1 — Create the starting frame using Nano Banana:
Medium shot of a female pop star singing passionately into a vintage microphone. She is on a dark stage, lit by a single, dramatic spotlight from the front. She has her eyes closed, capturing an emotional moment. Photorealistic, cinematic.
Step 2 — Create the ending frame with a different angle:
POV shot from behind the singer on stage, looking out at a large, cheering crowd. The stage lights are bright, creating lens flare. You can see the back of the singer's head and shoulders in the foreground. The audience is a sea of lights and silhouettes. Energetic atmosphere.
Step 3 — Animate with Veo using the First and Last Frame feature:
The camera performs a smooth 180-degree arc shot, starting with the front-facing view of the singer and circling around her to seamlessly end on the POV shot from behind her on stage. The singer sings "when you look me in the eyes, I can see a million stars."
Workflow 2: Dialogue Scenes with "Ingredients to Video"
Build multi-shot scenes with consistent characters engaged in conversation.
Step 1 — Generate your "ingredients": Create reference images using Nano Banana for your characters and settings.
Step 2 — Compose each shot using the Ingredients to Video feature with relevant reference images:
Shot 1:
Using the provided images for the detective, the woman, and the office setting, create a medium shot of the detective behind his desk. He looks up at the woman and says in a weary voice, "Of all the offices in this town, you had to walk into mine."
Shot 2:
Using the provided images for the detective, the woman, and the office setting, create a shot focusing on the woman. A slight, mysterious smile plays on her lips as she replies, "You were highly recommended."
Workflow 3: Timestamp Prompting
Direct a complete multi-shot sequence with precise cinematic pacing — all within a single generation. Assign actions to timed segments for efficient scene creation with visual consistency.
Example prompt:
[00:00-00:02] Medium shot from behind a young female explorer
with a leather satchel and messy brown hair in a ponytail, as
she pushes aside a large jungle vine to reveal a hidden path.
[00:02-00:04] Reverse shot of the explorer's freckled face, her
expression filled with awe as she gazes upon ancient, moss-covered
ruins in the background. SFX: The rustle of dense leaves, distant
exotic bird calls.
[00:04-00:06] Tracking shot following the explorer as she steps
into the clearing and runs her hand over the intricate carvings
on a crumbling stone wall. Emotion: Wonder and reverence.
[00:06-00:08] Wide, high-angle crane shot, revealing the lone
explorer standing small in the center of the vast, forgotten
temple complex, half-swallowed by the jungle. SFX: A swelling,
gentle orchestral score begins to play.
Quick Reference: Prompt Cheat Sheet
| Category | Keywords to Try |
|---|---|
| Camera Movement | Dolly shot, Tracking shot, Crane shot, Aerial view, Slow pan, POV shot |
| Composition | Wide shot, Close-up, Extreme close-up, Low angle, Two-shot, Over-the-shoulder |
| Lens & Focus | Shallow depth of field, Wide-angle lens, Macro lens, Deep focus, Soft focus |
| Audio - Dialogue | A man says, "...", She whispers, "..." |
| Audio - SFX | SFX: thunder cracks, SFX: glass shattering |
| Audio - Ambient | Ambient noise: rain on a tin roof, Ambient: busy cafe chatter |
| Style | Cinematic, Documentary, Retro film grain, Anime, Noir |
| Lighting | Golden hour, Chiaroscuro, Neon-lit, Soft diffused, Harsh spotlight |
Get Started
The best way to master these techniques is to apply them. Head to HeyMarmot and start experimenting with Veo 3.1. Pick one of the formulas above, craft a prompt, and iterate from there.
A few starter prompts to try:
Dolly shot pushing into a cozy Japanese coffee shop at dawn.
A barista carefully pours steamed milk into a ceramic cup.
Warm amber light streams through paper screens. SFX: gentle
jazz piano, the soft hiss of the espresso machine. Cinematic,
warm color palette.
Crane shot ascending from street level, a lone figure walks
through a neon-lit Tokyo alley at midnight. Rain glistens on
the pavement. Ambient noise: distant city hum, raindrops on
metal awnings. Cyberpunk aesthetic, moody blue and pink tones.
[00:00-00:04] Close-up of hands opening a vintage leather
journal, revealing hand-drawn maps and notes. SFX: pages
rustling. [00:04-00:08] Wide shot pulling back to reveal the
person sitting at a candlelit wooden desk in an old library.
Ambient: crackling fireplace, distant thunder. Warm, nostalgic
lighting.
Happy directing!
