The Ultimate Multimodal Prompt Collection

The generative AI landscape has evolved far beyond single-modality image generation. With the Gemini 3 family of models, Google DeepMind has delivered a unified creative stack that spans every medium: Nano Banana for hyper-realistic image generation and editing, Lyria for high-fidelity music and audio composition, and Veo 3.1 for cinematic video with synchronous audio.

This collection is the definitive prompting guide for that entire stack. Sourced from Google's official documentation, the NanoPrompts.org community library, and Chase Jarvis's professional workflows, it covers every modality with Before/After prompt examples that demonstrate the difference between amateur and expert prompting.

Introduction: The Multimodal Ecosystem

Nano Banana models are advanced image generation and editing systems built on the Gemini 3 family. They apply deep reasoning capabilities to fully understand prompts before generating output, delivering precise, rich visual results.

Nano Banana 2 (Gemini 3.1 Flash Image) brings real-time web search integration, fast generation, and Pro-level features like text rendering and upscaling to 2K/4K. Nano Banana Pro (Gemini 3 Pro Image) delivers the highest quality with the largest context window.

Lyria generates high-fidelity music and audio with control over genre, tempo, instrumentation, dynamics, and vocals. It supports text-to-music and image-to-music prompting.

Veo 3.1 is the latest evolution in video generation, featuring professional-grade creative controls, multiple aspect ratios, rich synchronous audio, and cinematic camera movement — all driven by structured prompting.

[!TIP] The models are designed to work together. Generate a keyframe with Nano Banana, animate it with Veo 3.1, and score it with Lyria — all in a single production pipeline.

Model Comparison

Feature	Nano Banana 2	Nano Banana Pro	Lyria 3	Veo 3.1
Output Type	Text + Image	Text + Image	Audio	Video + Audio
Resolution	512px - 4K	1K - 4K	N/A	720p - 1080p
Context Window	131K tokens	65K tokens	N/A	N/A
Max Reference Images	14	14	1	4 (ingredients)
Aspect Ratios	10+	10+	N/A	16:9, 9:16
Clip Length	N/A	N/A	N/A	4s, 6s, 8s
Audio Output	No	No	Yes	Yes (synchronous)
Text Rendering	Excellent	Excellent	N/A	N/A
Web Search Integration	Yes	Yes	N/A	N/A
C2PA / SynthID	Yes	Yes	Yes	Yes

Part 1: Text & Image — Nano Banana Pro

Nano Banana Pro excels at photorealistic rendering, character consistency, structured JSON prompting, and professional commercial output. This section covers the frameworks, techniques, and curated prompts that unlock its full potential.

1.1 The 5-Part Prompting Formula

The most reliable structure for Nano Banana generation follows a five-part formula. Start with a strong verb that tells the model the primary operation, then layer in specifics.

Formula: [Subject] + [Action] + [Location/Context] + [Composition] + [Style]

[!EXAMPLE] Before: "A fashion photo"

After: "A striking fashion model wearing a tailored brown dress, sleek boots, and holding a structured handbag. Posing with a confident, statuesque stance, slightly turned. On a seamless, deep cherry red studio backdrop. Medium-full shot, center-framed. Fashion magazine style editorial, shot on medium-format analog film, pronounced grain, high saturation, cinematic lighting effect."

Source: Google Cloud Blog

Nano Banana Pro fashion editorial generated with 5-part formula

1.2 Pseudo-Code Prompting (Chase Jarvis Technique)

Most people prompt like they're describing a dream — wandering sentences that lead to drift. The professional alternative is Pseudo-Code Prompting: define variables as distinct assets, then instruct the model how to combine them.

Why it works: You separate what from how. When iterating, you only change one variable — the model understands everything else must remain constant.

The Structure:

[VARIABLES]
SUBJECT_A = "Professional female model, mid-30s, sharp features,
  wearing a structured oversized beige blazer, silk texture."
LOCATION_B = "Brutalist architecture interior, concrete walls,
  sharp geometric shadows."
LIGHTING_C = "High-contrast rim lighting, cool blue fill from
  the left, warm key light from the right."
CAM_SETTINGS = "Phase One XF, 80mm lens, f/2.8, ISO 100,
  sharp focus on eyes."

[EXECUTION]
Render SUBJECT_A standing in LOCATION_B. Apply LIGHTING_C to
emphasize the texture of the blazer. Use CAM_SETTINGS for
a hyper-realistic commercial fashion look.

[!TIP] Use the "Thinking" or "Reasoning" mode for complex physics-based lighting. Add [REASONING: Calculate true light paths based on light source position] to force physics validation.

1.3 JSON Structured Prompting

For complex compositions where you need precise control over multiple elements, use structured JSON. Nano Banana's reasoning engine recognizes this logic and applies it consistently.

[!EXAMPLE] Before: "A young woman taking a mirror selfie, 2000s aesthetic"

After:

{
  "subject": {
    "description": "A young woman taking a mirror selfie with
      very long voluminous dark waves and soft wispy bangs",
    "age": "young adult",
    "expression": "confident and slightly playful",
    "hair": {
      "color": "dark",
      "style": "very long, voluminous waves with soft wispy bangs"
    },
    "clothing": {
      "top": {
        "type": "fitted cropped t-shirt",
        "color": "cream white",
        "details": "features a large cute anime-style cat face graphic"
      }
    }
  },
  "photography": {
    "camera_style": "early-2000s digital camera aesthetic",
    "lighting": "harsh super-flash with bright blown-out highlights",
    "angle": "mirror selfie",
    "texture": "subtle grain, retro highlights, V6 realism"
  },
  "background": {
    "setting": "nostalgic early-2000s bedroom",
    "elements": ["chunky wooden dresser", "CD player",
      "hanging beaded door curtain"]
  }
}

Source: @ZaraIrahh

1.4 Positive Framing & Negative Prompting

Nano Banana understands what you want better when you describe the positive outcome rather than the negative.

[!EXAMPLE] Before: "A street with no people, no cars, no modern buildings"

After: "A desolate cobblestone street at dawn, bathed in warm golden light. The storefronts are shuttered. No people, no vehicles. Atmospheric fog lingers near the ground."

Before: "Remove the red car"

After: "Replace the red car with a gray van that matches the lighting and perspective of the street. The van should appear parked, stationary, blending seamlessly with the ambient shadows."

Source: Google Cloud Blog

1.5 Lighting & Camera Controls

Nano Banana speaks the language of lenses and light. Use specific photographic and cinematic terminology to control depth, distortion, perspective, and mood.

1.5.1 Photography Terminology

Control Type	Example Prompt
Wide Angle	"Shot on Leica SL2 with a 24mm lens. Exaggerate foreground features. Vertical distortion on architectural elements."
Portrait	"Shot on Canon R5 with an 85mm f/1.2 lens. Extremely shallow depth of field. Bokeh should be creamy and circular."
Macro	"100mm Macro lens. 1:1 magnification. Focus stacking simulation for edge-to-edge sharpness on the product texture."
Low Angle	"Low angle shot, looking up at the subject against an overcast sky. Wide-angle perspective to emphasize height and drama."

[!EXAMPLE] Before: "Close-up portrait"

After: "Close-up portrait of a weathered sailor, shot on Fujifilm GFX 100 with a 110mm f/2 lens. The lens creates natural background compression. Skin texture must be visible — pores, fine lines, sun damage. Catchlights present in the eyes. Dramatic chiaroscuro lighting from a single window on the left."

Source: Chase Jarvis

1.5.2 Lighting Ratios

Nano Banana's reasoning engine calculates light bounces with surprising accuracy. Define the behavior of light explicitly.

Lighting Style	Prompt Description
Rembrandt	"Classic Rembrandt lighting. Key light at 45 degrees elevation, 45 degrees horizontal. Triangle of light on the shadowed cheek. Deep shadow density ratio of 3:1."
Commercial	"High-key commercial lighting. Large softbox source overhead. White bounce cards filling shadows. Even, flattering illumination. Catchlights in both eyes."
Chiaroscuro	"Dramatic chiaroscuro. Single directional source from above-right. Harsh shadows, high contrast. Key-to-fill ratio of 8:1."
Three-Point	"Three-point lighting: key light at 45 degrees, fill at 90 degrees at 50% power, rim light behind subject at full power creating edge separation."

[!EXAMPLE] Before: "Well-lit portrait"

After: "Three-point studio lighting. Key light: 90cm octabank, camera left, 70% power, creating soft shadows on the right cheek. Fill light: 60cm softbox, camera right, 35% power, lifting shadows to a 2:1 ratio. Rim light: narrow strip light behind subject, full power, separating hair from background. Deep charcoal gray seamless backdrop."

Source: Google Cloud Blog

1.5.3 Film Stock & Color Grading

Specify film stock, color science, or grading styles to control the emotional texture of the output.

[!EXAMPLE] Before: "A nostalgic photo"

After: "Kodak Portra 400 film aesthetic. Warm, nostalgic color science with slightly lifted blacks. The green channel pushed toward yellow. Skin tones with a peachy warmth. Subtle film grain, halation around highlights, soft contrast curve."

Before: "Cinematic look"

After: "Cinematic color grading with muted teal tones in shadows and warm orange in highlights. Teal shadows, orange highlights (the classic teal-orange grade). Desaturated blacks with crushed blacks in the background. Slight vignette."

Source: Google Cloud Blog

1.6 Reference Stack Workflows

The 14-reference-image limit is Nano Banana's most powerful professional feature. Use it strategically for campaign-scale consistency.

[!NOTE] Nano Banana supports up to 14 reference images in a single prompt. Recommended slot allocation: Slots 1-3 for character turnarounds, Slots 4-5 for brand assets (logo, color palette), Slots 6-10 for style and mood references.

1.6.1 The 14-Slot Reference Stack

[SLOT 1-3] Character Turnaround: front view, 3/4 view, side view
[SLOT 4]  Brand Logo: transparent PNG, primary color palette
[SLOT 5]  Color Swatches: exact hex values for campaign colors
[SLOT 6-8] Lighting References: mood board images defining the light quality
[SLOT 9-10] Style References: photography style, texture direction
[SLOT 11-14] Environment/Prop References: setting details, key props

Prompt:
"Using the character defined in Slots 1-3, place them into the
location described in Slot 11. Apply the lighting quality from
Slot 6. Ensure the brand logo from Slot 4 is visible on the
clothing. Maintain facial structure from Slot 1 exactly. Use
the color grading from Slot 9."

1.6.2 Weavy Pose-Change Technique

Using the Weavy node interface, you can decouple a subject's identity from their pose — transferring the geometry from one image to another.

[!TIP] Run the generation 3-4 times. If anatomy breaks (fingers, knees), swap the pose reference for a clearer image.

Inputs:

Top Node (Pose): A reference image with the desired geometry — stock photo, sketch, or 3D block-out
Bottom Node (Subject): Your target subject with the appearance you need to preserve

The Pose-Transfer Prompt:

[@img1 is the pose reference]
[@img2 is the character reference]

First, examine img1 and extract the subject's pose, including
the position of all limbs, torso angle, and head orientation.

The objective is to transfer the subject's pose from img1 to img2.

Create an image of the content as shown in img2, but with the
main character of img2 posed in the same way as the character
of img1.

Keep everything else about img2 the same — medium, color,
saturation, lighting quality, background.

Don't change the background or contents of img2. Only transfer
the pose from img1's subject to img2's subject.

Output: The subject from img2 rendered in the exact pose from img1.

[!NOTE] This technique works with sketches and 3D block-outs as the pose reference — not just photographs. Concept artists use this to turn napkin sketches into photorealistic assets.

Weavy pose-transfer workflow: transferring pose geometry to a character reference

1.6.3 Weavy Style-Cloning Technique

Apply the complete aesthetic of one image to the subject of another — without blending the content.

[!NOTE] The model doesn't just slap a filter on. It re-renders the subject from the ground up using the physics of the style reference.

Inputs:

Style Reference (img1): The "vibe" — defines lighting, texture, color palette, rendering technique
Content Reference (img2): The "subject" — the person, product, or scene you need

The Style-Clone Prompt:

Create an image of the content as shown in [@img2] but with the
same medium, color palette, mood, rendering technique, saturation
level, textures, and overall style of [@img1].

Extract ONLY the aesthetic qualities from img1 — do not include
any objects, subjects, or compositions from img1.

Apply the extracted style to img2's subject while preserving
img2's subject identity and composition.

Output: The subject from img2 rendered with the lighting, texture, and color science of img1.

[!EXAMPLE] Style Reference: A glowing, subsurface-scattering orange illustration of cats on a blue background. Content Reference: A standard photograph of a king cobra. Result: The cobra re-rendered with the translucent orange glow and lighting physics of the cat illustration — without becoming a cat.

Weavy style-cloning: transferring lighting and texture from one image to another

1.7 Text Rendering & Typography

Nano Banana is the first AI image model with reliable typography — rendering sharp, legible text on posters, packaging, and product mockups. It supports multilingual text in 10+ languages.

[!TIP] Text-first hack: When generating text-heavy images, first converse with the model to generate the text concepts, then request the final image with that text embedded. This ensures the model gets the text right before worrying about composition.

Rules for text rendering:

Rule	Example
Use quotes around text	`"CREATIVE FUTURE"` in bold white sans-serif (Helvetica style)
Describe the font explicitly	"Century Gothic 12px font" or "flowing Brush Script"
Specify placement	"Title text at top, subtitle below, 2/3 text area"
Define layering	"Text acts as a cut-out window over the subject"

[!EXAMPLE] Before: "A poster that says Creative Future"

After: "A typographic poster with a solid black background. The words 'CREATIVE FUTURE' in bold white Helvetica Neue font, filling the center of the frame. The text acts as a cut-out window. A photograph of a misty mountain landscape is visible ONLY inside the letterforms, with soft bokeh in the background."

Source: Google Cloud Blog

1.8 Real-Time Web Search Integration

Nano Banana 2 is powered by real-time information from web search. Instead of describing a fictional scene, instruct the model to retrieve current data and visualize it.

The Formula: [Search/Source Request] + [Analytical Task] + [Visual Translation]

[!EXAMPLE] Before: "The weather in San Francisco today"

After:

Search for current weather conditions, date, and time in San Francisco.
Analytically, use this data to modify the scene: if it's raining,
render the city with overcast skies and wet reflective streets.
If it's sunny, render warm golden light washing over the buildings.
Visualize this as a miniature city-in-a-cup concept embedded within
a realistic, modern smartphone UI. The miniature city should reflect
the actual current weather of San Francisco.

Source: Google Cloud Blog

1.9 Expert Prompt Collection — Text & Image

The following prompts represent battle-tested techniques from the NanoPrompts.org community, Google Cloud documentation, and Chase Jarvis's professional workflows.

1.9.1 Hyperrealistic Celebrity Crowd

[!EXAMPLE]

Create a hyper-realistic, ultraSharp, full-color large-format
image featuring a massive group of celebrities from different eras,
all standing together in a single wide cinematic frame. The image
must look like a perfectly photographed editorial cover with impeccable
lighting, lifelike skin texture, micro-details of hair, pores,
reflections, and fabric fibers.

GENERAL STYLE & MOOD: Photorealistic, 8k, shallow depth of field,
soft natural fill light + strong golden rim light. High dynamic range,
calibrated color grading. Skin tones perfectly accurate. Crisp fabric
detail with individual threads visible. Balanced composition,
slightly wide-angle lens (35mm), center-weighted.

THE ENVIRONMENT: A luxurious open-air rooftop terrace at sunset
overlooking a modern city skyline. Warm golden light wrapping around
silhouettes. Polished marble surfaces reflecting ambient light.

Source: @SebJefferies

Simulate complex studio setups before renting gear — a pre-visualization tool that saves studio time.

[!EXAMPLE]

[SETUP]
Subject in center, looking at camera.
Light 1: 10ft octabank, camera left, 50% power, creating soft
  wrap-around shadows.
Light 2: Snooted kicker, camera right rear, 100% power, teal gel
  creating colored edge light on hair and shoulder.
Light 3: Ring light fill, on-axis, 25% power, lifting shadow
  density under the nose.
Background: seamless gray paper, lit evenly.

Render this as a photorealistic simulation of the above lighting
diagram. The subject is a professional male model, mid-40s, wearing
a navy wool suit.

Source: Chase Jarvis

1.9.3 Museum Art Exhibition Fusion

[!EXAMPLE]

A commercial grade photograph of [uploaded reference image] posing
inside a high-end museum exhibition space.

Behind them hangs a large, ornate framed classical oil painting.
The painting depicts the same person but rendered in a rich,
traditional oil painting style with thick, visible impasto
brushstrokes, deep textures, and rich color palettes on canvas.
Gallery spotlights hit the textured paint surface.

Masterpiece, ultra-detailed, cinematic lighting, strong contrast,
dramatic shadows, 8K UHD, highly detailed textures,
professional photography.

Source: @brad_zhang2024

1.9.4 Product Shot with Luxury Lighting

[!EXAMPLE]

Product: [BRAND] [PRODUCT NAME] - [bottle shape],
  [label description], [liquid color]

Scene: Luxury product shot floating on dark water with
  [flower type] in [colors] arranged around it.
  [Lighting style] creates reflections and ripples
  across the water.

Mood & Style: [Adjectives], high-end commercial photography,
  [camera angle], shallow depth of field with soft bokeh
  background

Source: @AmirMushich

1.9.5 Coordinate-to-Image Generation

Generate specific locations at specific times using latitude/longitude coordinates.

[!EXAMPLE] Before: "A famous location"

After: "Create an image at 35.6586 degrees N, 139.7454 degrees E (Tokyo) at 19:00. Golden hour has just passed. The Tokyo Tower is illuminated in orange against a deep blue twilight sky. Cherry blossoms are in full bloom along the walkway. Steam rises from street food vendors. Cinematic composition, wide-angle establishing shot."

Source: Google Cloud Blog (coordinates from Replicate)

Part 2: Music & Audio — Lyria

Lyria generates high-fidelity music and audio from text prompts and images. Built for creators who need production-ready tracks — not loops — Lyria gives you control over every dimension of a musical arrangement.

2.1 Prompting Architecture

Lyria prompting follows a layered structure. Each layer builds on the last: Genre establishes the foundation, Tempo sets the pace, Instruments fill the arrangement, Dynamics shape the flow, and Vocals carry the melody.

[!NOTE] Lyria supports image-to-music: upload any image and describe its mood to generate a matching soundtrack. Think about the subject, location, lighting, and atmosphere — Lyria interprets these visual cues musically.

2.1.1 Genre & Era Control

Define the primary genre and optionally blend eras or styles.

[!EXAMPLE] Before: "A rock song"

After: "1980s arena rock anthem. Heavy kick drum with double-pedal speed. Thick, gated-reverb snare cracking on beats 2 and 4. Distorted power chords in drop-D tuning. Emotive male tenor lead vocal with long sustained notes. Analog synthesizer pads in the background. Stadium reverb on the entire mix."

Source: DeepMind Lyria Prompt Guide

Genre Blending Examples:

Prompt	Result
"K-pop with a Motown edge"	Contemporary K-pop production values with classic soul vocal phrasing and brass hits
"Classical violins merged into a funk track"	Funk rhythm section with orchestral string arrangements overlaid
"Early 90s hip-hop with 808s and jazz samples"	Boom-bap drums, warm vinyl texture, jazz piano loops

2.1.2 Tempo Specification

Specify tempo directly (BPM) or indirectly (descriptive terms).

[!EXAMPLE] Before: "A fast song"

After: "170 BPM drum and bass track. Rapid-fire hi-hat pattern at 16th notes. Fast-attack synthesizers. Energetic, urgent atmosphere."

Before: "A slow song"

After: "62 BPM slow soul ballad. Spacious drums with long decays. Relaxed tempo that allows each note to breathe."

Source: DeepMind Lyria Prompt Guide

2.1.3 Instrument Selection

Add specific instruments to shape the sonic character. If you don't specify, Lyria auto-selects instruments to suit the genre.

[!EXAMPLE] Before: "A jazz song"

After: "Quintessential 1970s Motown soul. Lush, orchestral R&B production. Warm bassline with melodic fills, locked into a steady drum groove with crisp snare and tambourine. Vintage organ harmonic bed. Three-piece brass section. Gritty, gospel-tinged male tenor lead vocal."

Source: DeepMind Lyria Prompt Guide

Instrument Control	Example
Add unexpected instruments	"1990s R&B with 80s synth" — adds analog synth textures to contemporary production
Specify instrument behavior	"Clean funk-style guitar rhythm, staccato chord stabs on the upbeat, warm wah-wah pedal swells, no distortion"
Layer textures	"Dense orchestral arrangement: string quartet, brass quintet, harp, tubular bells"

2.1.4 Dynamics & Arrangement

Define how music flows between sections — builds, drops, instrumental breaks, and dynamic swells.

[!EXAMPLE] Before: "A song with a loud part"

After: "Wistful and airy. Soft, breathy female vocals with intimacy. The track builds slowly from a quiet piano intro into an explosive chorus at 1:30, with full drum kit, swelling strings, and layered backing vocals. After the chorus, it returns to the quiet piano arrangement with only vocals and soft synth pads."

Before: "A song with background music"

After: "Nocturnal aesthetic with cinematic forward motion. The track opens with ambient synth pads for 8 bars, then introduces a driving 16th-note analog synthesizer bass arpeggio. Percussion anchored by a powerful snare with 1980s gated reverb. Swelling cinematic pads build throughout. Male vocalist with soaring vocal lines enters at bar 16."

Source: DeepMind Lyria Prompt Guide

2.2 Vocals & Lyrics

Lyria supports vocal generation with control over gender, range, timbre, language, and lyric content.

2.2.1 Vocal Profiles

Define the singer's characteristics explicitly.

Vocal Trait	Prompt Example
Gender + Range	"Rich female alto, commanding baritone vocals, clear and high soprano range"
Timbre	"Gravelly, soulful, breathy, bright, warm, nasally"
Language	"Singing in English, French, Korean, Japanese"
Vocal Pattern	"Fast-paced rap verses, laid-back melodic chorus, call-and-response between lead and backing vocals"

[!EXAMPLE] Before: "A song with a singer"

After: "A breathy soprano with intimate, hushed delivery. The voice sits low in the mix, almost whispering. Occasional falsetto runs. No vibrato, no ornamentation — pure, raw emotion. Like a late-night confessional."

Source: DeepMind Lyria Prompt Guide

2.2.2 Custom Lyrics Syntax

Write specific lyrics using the Lyrics: prefix. Add backing vocal echoes in parentheses.

[!EXAMPLE]

Lyrics: The city lights are bleeding through the rain,
We're dancing in the memories left behind.
Running at the speed of a whispered name,
Caught in a rhythm only we can find.

(Letra: Las luces de la ciudad sangran a través de la lluvia,
Bailamos en los recuerdos que dejamos atrás.)

Source: DeepMind Lyria Prompt Guide

2.2.3 Thematic Lyrics

Let Lyria generate lyrics by describing the emotional theme clearly.

[!EXAMPLE] Before: "A song"

After:

"A love song about finding yourself after a breakup — bittersweet, hopeful, anthemic"

"A new happy birthday song for your best friend — playful, warm, acoustic guitar-driven"

"An instrumental track evoking the quiet atmosphere of a Rio beach at sunset — gentle bossa nova rhythm, warm nylon guitar, soft wave sounds"

Source: DeepMind Lyria Prompt Guide

2.3 Image-to-Music Prompts

Upload any image to generate music that matches its mood. Think about three dimensions:

Image Dimension	What to Describe	Musical Translation
Subject	Who or what is the focus?	Genre, vocal gender, energy level
Location	Indoor/outdoor, city/nature, setting	Instrumentation, ambient sounds, tempo
Atmosphere	Happy, sad, tense, calm	Key, dynamics, tempo, chord progression

[!EXAMPLE] Image: A proud-looking ginger cat sitting on a blanket draped over a cozy armchair. Soft light streams through a window, illuminating a coffee table with a cup and several stacked books. The cat's eyes are semi-closed — relaxed and sleepy.

Prompt: "A lazy Sunday afternoon. Relaxed acoustic guitar strumming a fingerpicked pattern. Soft jazz piano chords in the background. The sound of a gentle rain on glass. Warm, nostalgic, peaceful. No vocals — pure instrumental atmosphere."

Source: DeepMind Lyria Prompt Guide

2.4 Expert Prompt Collection — Music & Audio

2.4.1 Cinematic Rock Anthem

[!EXAMPLE]

This is a massive, anthemic Alternative Rock chorus in the style
of Post-Grunge and Arena Rock. The foundation is a thunderous,
powerful drum kit: a heavy kick drum hits while a thick,
gated-reverb snare cracks on beats 2 and 4. A driving, melodic
bass line propels the harmony forward, acting as a crucial melodic
anchor.

Layered electric guitars play palm-muted power chords with
aggressive distortion. A lead guitar soars with a sustained
pentatonic solo over the final 8 bars. The mix is thick and dense,
with reverbs reaching 2-3 seconds on the drums.

Floating powerfully over this dense instrumental wall is an emotive
male tenor lead vocal, belting at full chest voice. Backing vocals
harmonize in thirds. The chorus ends with a dramatic drum fill.

Tempo: 128 BPM. Key: E minor.

Source: DeepMind Lyria Prompt Guide

2.4.2 Bossa Nova Sunset

[!EXAMPLE]

An intimate, sophisticated Brazilian Bossa Nova track evoking the
quiet atmosphere of a Rio beach at sunset. The tempo is a gentle
78 BPM. A nylon-string acoustic guitar plays the characteristic
syncopated bossa nova rhythm — staccato chords on beats 2 and 4.
A upright bass walks a relaxed melodic line.

Gentle female vocals in Portuguese, breathy and intimate,
with natural room ambience. The melody floats above the
arrangement with subtle reverb. Soft shakers and a nylon guitar
provide the rhythmic pulse. Gentle wave sounds blend into the
mix as ambient texture.

The arrangement is sparse — only guitar, bass, vocals, and subtle
percussion. Warm, romantic, nostalgic.

Source: DeepMind Lyria Prompt Guide

2.4.3 Electronic Dance Floor

[!EXAMPLE]

Driving electronic dance music at 128 BPM. Four-on-the-floor
kick drum with crisp transient attack. Layered hi-hats — closed
on the 8th notes, open on the off-beats. Sidechained synth
pads pumping in sync with the kick.

A catchy melodic hook played on a warm analog supersaw
synthesizer. Filter sweeps automate on every 8 bars. The bass
is a thick sawtooth wave with moderate compression.

Breakdown at 1:30: all elements drop except a filtered
arpeggiated synth and single kick. Build-up reintroduces
elements one by one. Full release at 2:00 with all elements
at maximum volume. No vocals — pure instrumental energy.

Source: DeepMind Lyria Prompt Guide

2.4.4 Lo-Fi Hip-Hop Study Session

[!EXAMPLE]

A lo-fi hip-hop instrumental for studying. 85 BPM. Vinyl-warmed
sampled drums with a heavily compressed kick and snare. The hi-hat
pattern is swung slightly. A looped jazz piano sample plays a
minor-key chord progression with natural reverb decay. A warm
vinyl crackle sits at -24 dB in the background.

A double bass plays a walking line that steps through the chord
changes. The entire sample is processed through a low-pass filter
that opens slightly during the chorus section. The mix is warm
and slightly muddy — intentionally lo-fi. No vocals.

Total runtime: 3 minutes, seamless loop.

Source: DeepMind Lyria Prompt Guide

2.4.5 Celtic Folk Ballad

[!EXAMPLE]

A Celtic folk ballad. Solo acoustic guitar in DADGAD tuning,
fingerpicked in a traditional Celtic style. The melody is
played on a tin whistle with natural vibrato. Occasional
violin (fiddle) enters during the chorus, playing a mournful
melody in A minor.

A bodhran drum provides a steady pulse. Male vocalist sings
in a traditional Irish folk style — slight nasality, strong
projection, no vibrato. The lyrics tell a story of a sailor
lost at sea.

The arrangement grows organically: guitar solo intro,
whistle joins at verse 2, full ensemble by the final chorus.
Gentle room reverb. Recorded to sound like a live pub session.

Source: DeepMind Lyria Prompt Guide

2.5 Audio Parameter Quick Reference

Parameter	Values / Range	Prompt Example
Tempo	40-220 BPM	"150 BPM drum and bass", "62 BPM slow ballad"
Key	All major/minor keys	"E minor", "Bb major", "C# minor"
Dynamics	pp to ff	"Crescendo into a fortissimo release"
Genre	Any music genre	"Late 70s disco with funk bass", "Shoegaze dream pop"
Instruments	Any	"Mellotron strings, Rickenbacker 12-string guitar"
Vocals	Male/Female, any language	"Breathy female alto, singing in Japanese"
Lyrics	Custom or thematic	"Lyrics: [your lyrics]", "A love song about loss"
Mix Style	Wet/Dry, lo-fi/wide	"Wet, cavernous reverb, 1970s analog warmth"

Part 3: Video & Motion — Veo 3.1

Veo 3.1 is Google's state-of-the-art video generation model. It brings professional-grade creative controls, multiple aspect ratios, rich synchronous audio, and cinematic camera movement to a prompting-driven workflow.

3.1 Core Capabilities & Tech Specs

Feature	Specification
Resolution	720p or 1080p
Aspect Ratios	16:9, 9:16
Clip Length	4s, 6s, or 8s
Audio	Synchronous, multi-person dialogue, SFX, ambient
Image-to-Video	Stronger prompt adherence than Veo 3
Ingredients to Video	Up to 4 reference images for character/scene consistency
First/Last Frame	Seamless transition between start and end images
Add/Remove Object	Modify generated videos (Veo 2 engine, no audio)
Watermarking	SynthID on all outputs

[!WARNING] Veo 3.1 Add/Remove object currently uses the Veo 2 model internally and does not generate audio.

3.2 The 5-Part Cinematic Formula

The Veo 3.1 prompting formula mirrors the image formula but adds temporal and audio dimensions.

Formula: [Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance] + [Audio]

[!EXAMPLE] Before: "A person working in an office"

After: "Medium shot, a tired corporate worker rubbing his temples in exhaustion, in front of a bulky 1980s computer in a cluttered office late at night. The scene is lit by harsh fluorescent overhead lights and the green glow of the monochrome monitor. Retro aesthetic, shot as if on 1980s color film, slightly grainy. Ambient: the hum of old CRT monitors and the click of a mechanical keyboard."

Source: Google Cloud Blog

3.3 Camera Movement Language

The [Cinematography] element is the most powerful tool for conveying tone and emotion. Use specific terms from professional filmmaking.

Camera Movement Reference Table

Movement	Description	Prompt Example
Dolly shot	Camera moves toward or away from subject	"Slow dolly in toward the character's face, revealing their expression"
Tracking shot	Camera follows subject horizontally	"Tracking shot following the explorer as she steps into the clearing"
Crane shot	Camera moves up/down on a crane	"Crane shot starting low, ascending high, revealing the vast canyon"
Aerial view	Drone-style overhead shot	"Aerial view from 200 meters, slowly orbiting the castle"
Slow pan	Horizontal rotation	"Slow pan left to reveal the city skyline emerging from fog"
POV shot	First-person perspective	"POV shot from behind the singer, looking out at a cheering crowd"
Dutch angle	Tilted frame for tension	"Dutch angle, tilted 15 degrees, to convey disorientation"

[!EXAMPLE] Before: "Video of a canyon"

After: "Crane shot starting low on a lone hiker and ascending high above, revealing they are standing on the edge of a colossal, mist-filled canyon at sunrise, epic fantasy style, awe-inspiring, soft morning light."

Source: Google Cloud Blog

Composition & Lens Controls

Control	Description	Prompt Example
Wide shot	Full environmental context	"Wide shot revealing the vast temple complex from above"
Close-up	Isolated detail focus	"Extreme close-up on weathered hands tracing ancient carvings"
Low angle	Looking up, empowering	"Low angle, upward tilt, emphasizing the skyscraper's height"
Shallow DOF	Blurred background	"Close-up with very shallow depth of field, soft bokeh"
Wide-angle lens	Exaggerated perspective	"GoPro-style wide-angle, immersive distorted action feel"
Deep focus	Everything in sharp focus	"Deep focus, every element sharp from foreground to background"

[!EXAMPLE] Before: "A woman on a bus"

After: "Close-up with very shallow depth of field, a young woman's face, looking out a bus window at the passing city lights with her reflection faintly visible on the glass, inside a bus at night during a rainstorm, melancholic mood with cool blue tones, moody, cinematic."

Source: Google Cloud Blog

Veo 3.1 cinematic shot with shallow depth of field

3.4 Sound Design Controls

Veo 3.1 generates complete, synchronized soundtracks based on text instructions.

Audio Type	Syntax	Example
Dialogue	Quotation marks for speech	'A woman says, "We have to leave now."'
Sound Effects (SFX)	Describe sounds precisely	"SFX: thunder cracks in the distance, followed by heavy rain"
Ambient Noise	Define the soundscape	"Ambient: the quiet hum of a starship bridge, distant console beeps"
Music	Describe the score	"Swell to a cinematic orchestral score with rising strings"

[!EXAMPLE] Before: "Add some sound effects"

After:

[00:00-00:02] Medium shot of a woman entering a dark forest.
SFX: crunching dry leaves underfoot, wind rustling through branches.
Ambient: distant owl calls, the sound of the forest settling for night.

[00:02-00:04] Close-up of the woman's face, eyes widening in fear.
SFX: a sudden snap of a twig nearby. Heartbeat sound effect begins.

[00:04-00:06] Wide shot revealing what she sees: ancient stone ruins.
Music: a single cello note, low and foreboding, sustained for 3 seconds.

Source: Google Cloud Blog

3.5 Negative Prompting for Video

Refine your video output by describing exclusions with precision.

[!EXAMPLE] Before: "No bad things in the video"

After:

"A desolate landscape with no buildings, no roads, no vehicles, no modern infrastructure — only wilderness"

"A crowd scene with no blurred faces, no duplicate characters, no distorted hands"

"An underwater scene with no visible camera equipment, no bubbles from artificial sources, no anachronistic objects"

[!NOTE] For video, negative prompting is especially useful for: motion artifacts ("no jittering, no motion blur on static objects"), continuity errors ("the time of day remains consistent throughout"), and visual noise ("no flicker, no frame drops").

3.6 Advanced Creative Workflows

3.6.1 Workflow: First and Last Frame Transitions

Create controlled camera movements or transformations between two distinct images using the First/Last Frame feature.

Step 1 — Generate the starting frame with Nano Banana:

Medium shot of a female pop star singing passionately into a vintage
microphone. She is on a dark stage, lit by a single, dramatic
spotlight from the front. She has her eyes closed, capturing an
emotional moment. Photorealistic, cinematic, shot on medium-format
camera, 85mm lens, shallow depth of field.

Veo 3.1 first frame: female pop star on dark stage

Step 2 — Generate the ending frame with Nano Banana:

POV shot from behind the singer on stage, looking out at a large,
cheering crowd. The stage lights are bright, creating lens flare.
You can see the back of the singer's head and shoulders in the
foreground. The audience is a sea of lights and silhouettes.
Energetic atmosphere. Photorealistic, cinematic.

Veo 3.1 last frame: POV from stage looking at audience

Step 3 — Animate with Veo 3.1:

The camera performs a smooth 180-degree arc shot, starting with
the front-facing view of the singer and circling around her to
seamlessly end on the POV shot from behind her on stage. The
singer sings "when you look me in the eyes, I can see a million
stars." SFX: crowd cheering, stage lights humming. Music: swelling
arena rock anthem.

[!TIP] The transition prompt should describe the camera movement and what happens between the two frames, not just repeat the images.

3.6.2 Workflow: Ingredients to Video (Character Consistency)

Generate multi-shot scenes with consistent characters using the Ingredients to Video feature with up to 4 reference images.

Step 1 — Generate your ingredients with Nano Banana:

Create reference images for each character and the setting (up to 4 total).

Step 2 — Compose the scene:

Using the provided images for the detective, the woman, and the
office setting, create a medium shot of the detective behind his
desk. He looks up at the woman and says in a weary voice,
"Of all the offices in this town, you had to walk into mine."
SFX: the creak of an old office chair, rain on glass outside.
Ambient: the muffled sound of city traffic, distant thunder.

Veo 3.1 ingredients-to-video: detective dialogue scene

Using the provided images for the detective, the woman, and the
office setting, create a shot focusing on the woman. A slight,
mysterious smile plays on her lips as she replies, "You were
highly recommended." Camera slowly dollies toward her face.
Lighting: a desk lamp creates a pool of warm light, the rest
of the office fades into shadow.

[!NOTE] The Ingredients to Video feature now supports audio generation alongside the consistent character visuals. Each shot can have its own dialogue, SFX, and ambient audio.

3.6.3 Workflow: Timestamp Prompting

Direct a complete multi-shot sequence with precise cinematic pacing — all within a single generation.

[!EXAMPLE] Before: Single paragraph prompt

After:

[00:00-00:02] Medium shot from behind a young female explorer
with a leather satchel and messy brown hair in a ponytail, as she
pushes aside a large jungle vine to reveal a hidden path.
Camera: slow dolly forward.

[00:02-00:04] Reverse shot of the explorer's freckled face,
her expression filled with awe as she gazes upon ancient,
moss-covered ruins in the background.
SFX: The rustle of dense leaves, distant exotic bird calls.

[00:04-00:06] Tracking shot following the explorer as she
steps into the clearing and runs her hand over the intricate
carvings on a crumbling stone wall. Emotion: Wonder and reverence.

[00:06-00:08] Wide, high-angle crane shot, revealing the
lone explorer standing small in the center of the vast,
forgotten temple complex, half-swallowed by the jungle.
SFX: A swelling, gentle orchestral score begins to play.
Ambient: the sound of wind through ancient stone corridors.

Source: Google Cloud Blog

3.7 Expert Prompt Collection — Video & Motion

3.7.1 Cinematic Restaurant Scene

[!EXAMPLE]

Slow dolly shot, wide angle, inside a candlelit Italian restaurant
at night. A couple sits at a corner table, engaged in quiet
conversation. Waiters move gracefully between tables carrying
plates of pasta. Warm amber light from candles and Edison bulbs
creates intimate pools of light. The background dining room
softens into bokeh.

Camera slowly tracks toward the couple as one of them reaches
across the table. Dialogue: "You know, I've never told anyone
this before..."

SFX: the gentle clink of wine glasses, soft jazz from a corner
quartet, the murmur of other diners.

Style: romantic cinema, warm color grade, lens flare from candle
light, shallow depth of field, 24fps cinematic motion.

Source: Google Cloud Blog (inspired by Veo 3.1 recipe)

3.7.2 Cyberpunk Street Chase

[!EXAMPLE]

POV shot running through rain-slicked neon-lit cyberpunk alleyways.
The camera bobs and weaves with the runner's pace — handheld
aesthetic, wide-angle lens, fast motion blur on rain drops.
Holographic advertisements flicker in multiple languages.
Steam rises from vents in the pavement.

Cut to: Low angle tracking shot, following the runner's boots
slapping through puddles of neon reflections — pink, cyan, amber.
SFX: footsteps echoing, distant sirens, rain hitting metal.
Ambient: an AI-generated city soundscape, distant chatter in
Japanese and English, hovering vehicle hums overhead.

Style: Blade Runner 2049 meets Akira. High contrast, teal shadows,
orange/amber highlights. Slight film grain. Cinematic letterboxing.

Source: Google Cloud Blog (inspired by Veo 3.1 recipe)

3.7.3 Underwater documentary

[!EXAMPLE]

Wide establishing shot of a coral reef at midday. Sunlight
shafts pierce the water surface from above, creating god rays.
Schools of colorful fish move in synchronized patterns.
A sea turtle glides slowly through the frame.

Camera slowly pushes in toward a coral formation, revealing
intricate detail. Macro lens simulation, deep focus.
The scene transitions to: Close-up of a tiny clownfish hiding
among an anemone's tentacles.

SFX: Bubbles rising steadily, the muffled sounds of the ocean
surface above, whale song in the distant background.
Ambient: A gentle, orchestral underwater documentary score —
soft strings, woodwinds, sustained cello notes.

Style: BBC nature documentary, vibrant color saturation,
natural lighting, smooth camera movements, 30fps.

Source: Google Cloud Blog (inspired by Veo 3.1 recipe)

3.7.4 Live Concert — Three-Shot Sequence

[!EXAMPLE]

Shot 1 [00:00-00:02]: Wide shot of a packed outdoor festival
at dusk. Thousands of people raise their phones recording the
main stage. Pyrotechnics erupt behind the headline act.
Camera: static wide, crowd fills the frame.

Shot 2 [00:02-00:05]: Medium shot of the lead singer at
center stage, microphone in hand, belting into the crowd.
Stage lights in every color of the spectrum. Camera: slow
dolly toward the singer. Dialogue: the singer shouts,
"San Francisco, make some noise!"

Shot 3 [00:05-00:08]: Extreme close-up on the singer's face,
sweat on the skin, eyes closed, pure emotion. Lens flares
from stage lights. Music: the band launches into the final
chorus — thunderous drums, distorted guitars, crowd roaring.

Source: Google Cloud Blog (inspired by Veo 3.1 recipe)

3.7.5 Cozy Coffee Shop Interior

[!EXAMPLE]

Medium shot of a cozy independent coffee shop on a rainy
afternoon. A young woman sits at a window table, typing on
a laptop, a latte and open book beside her. Rain streaks
down the window. String lights hang above the counter.

Camera: static medium shot, shallow depth of field on the
woman, the background café activity soft but present.
SFX: the hiss of the espresso machine, gentle rain on glass,
soft indie folk music playing from overhead speakers.

Style: Wes Anderson meets lo-fi aesthetic. Warm amber and
teal color palette. Slightly desaturated. 24fps with gentle
motion. The entire scene feels like a warm hug.

Source: Google Cloud Blog (inspired by Veo 3.1 recipe)

3.8 Veo 3.1 Parameter Quick Reference

Parameter	Options	Prompt Syntax
Resolution	720p, 1080p	Set in Vertex AI console
Aspect Ratio	16:9 (landscape), 9:16 (portrait)	Set in Vertex AI console
Clip Length	4s, 6s, 8s	Set in Vertex AI console
Camera Movement	Dolly, tracking, crane, aerial, pan, tilt, POV	"Slow dolly in", "tracking shot following", "crane shot ascending"
Lens	Wide, portrait, macro, fisheye	"Wide-angle lens", "85mm portrait lens simulation"
Depth of Field	Shallow, deep, bokeh	"Very shallow depth of field", "everything in focus"
Frame Rate Feel	Cinematic 24fps, smooth 60fps	"Cinematic 24fps motion", "slow-motion 60fps"
Lighting	Golden hour, harsh flash, studio, ambient	"Harsh fluorescent overhead lights", "warm golden hour backlight"
Film Stock	1980s color film, noir, modern digital	"Shot as if on 1980s color film, slightly grainy"
Dialogue	Quoted speech	"She says, 'We need to go.'"
SFX	Described sounds	"SFX: thunder crack, glass breaking"
Ambient	Soundscape	"Ambient: ocean waves, distant seagulls"

The Gemini 3 family of models is designed to work as an integrated production pipeline. Here's how to connect them.

4.1 Nano Banana to Veo 3.1 — Keyframe Workflow

The most powerful video production workflow starts with Nano Banana generating a storyboard of keyframes, then Veo 3.1 animating between them.

Step 1: Generate your keyframes with Nano Banana Pro (use the timestamp formula from Section 3.6.3).

Step 2: Load the start and end frames into Veo 3.1's First/Last Frame feature.

Step 3: Write a Veo 3.1 prompt that describes the camera movement and audio between the two frames.

Step 4: Add Lyria-generated music to score the final video.

4.2 Nano Banana to Lyria — Image-to-Music

Upload any Nano Banana-generated image to Lyria and describe its mood for a matching soundtrack.

[!EXAMPLE] Nano Banana generates: A misty mountain landscape at dawn, a lone hiker silhouetted against an orange sky.

Lyria prompt: "A cinematic ambient soundtrack for a mountain landscape. Solo acoustic guitar playing a contemplative fingerpicked pattern. The sound of wind through pine trees. A single bird call in the distance. No vocals. Expansive, peaceful, meditative. 70 BPM."

Source: DeepMind Lyria Prompt Guide

4.3 Veo 3.1 + Lyria — Full Production Pipeline

For complete productions, use all three models in sequence:

Nano Banana generates concept art and keyframes
Veo 3.1 animates the keyframes with synchronous audio
Lyria generates a custom score that layers with or replaces the Veo 3.1 audio

[!TIP] Use Lyria's stem controls to layer music under Veo 3.1's dialogue and SFX. Prompt Lyria with: "Background instrumental only, designed to sit beneath dialogue and sound effects. Tempo: 95 BPM, unobtrusive acoustic arrangement."

Quick Reference: Top 20 Expert Prompts

Top 10 — Nano Banana (Text & Image)

#	Prompt Description	Category
1	"Professional studio headshot, Sony A7III 85mm f/1.4, three-point lighting, navy suit, natural skin texture, 8K"	Portrait
2	"Pseudo-code: SUBJECT_A (model in blazer) + LOCATION_B (brutalist interior) + LIGHTING_C (rim lighting) — hyper-realistic"	Technique
3	"JSON: Celebrity crowd at sunset rooftop, 35mm, ultraSharp, 8k photorealistic, rim light"	Complex
4	"[@img1=pose] [@img2=character] Transfer pose from img1 to img2, preserve lighting"	Weavy
5	"[@img1=style] [@img2=subject] Apply same medium, texture, color palette from img1 to img2"	Weavy
6	"Lightbox simulation: octabank key, teal gel kicker, ring fill, seamless gray backdrop"	Studio
7	"Text rendering: 'CREATIVE FUTURE' in bold white Helvetica, typographic poster, black background"	Typography
8	"Coordinate visualization: 40.7128 degrees N, 74.0060 degrees W, September 11 2001, 08:46"	Creative
9	"Kodak Portra 400 film aesthetic, warm nostalgic color, golden hour backlight, film grain"	Film Style
10	"Luxury product shot floating on dark water with orchids, reflection ripples, soft bokeh"	Product

Top 5 — Lyria (Music & Audio)

#	Prompt Description	Genre
1	"1980s arena rock anthem, heavy kick, gated reverb snare, emotive male tenor, 128 BPM E minor"	Rock
2	"Brazilian Bossa Nova, 78 BPM, nylon guitar, upright bass, breathy female vocals in Portuguese"	Bossa Nova
3	"Lo-fi hip-hop study track, 85 BPM, vinyl-warmed sampled drums, jazz piano loop, vinyl crackle"	Lo-Fi
4	"Cinematic orchestral score, 95 BPM, cello and violin, designed to sit beneath dialogue"	Cinematic
5	"170 BPM drum and bass, rapid-fire hi-hats, analog synth arpeggio, atmospheric pads, no vocals"	Electronic

Top 5 — Veo 3.1 (Video & Motion)

#	Prompt Description	Technique
1	"Crane shot ascending from lone hiker to reveal mist-filled canyon at sunrise"	Camera Movement
2	"[00:00-00:02] Wide jungle reveal [00:02-00:04] Reverse on face [00:04-00:06] Tracking [00:06-00:08] Crane reveal"	Timestamp
3	"First frame: singer at microphone. Last frame: POV from stage. 180-degree arc shot, crowd cheering"	First/Last Frame
4	"Medium shot, detective says 'Of all the offices in this town...' using provided ingredient images"	Ingredients to Video
5	"POV running through neon cyberpunk alleys, handheld wide-angle, rain-slicked reflections, SFX footsteps"	Cinematic

Conclusion

The Gemini 3 multimodal stack represents a new era of AI content creation — one where text, image, audio, and video are no longer separate disciplines but a unified creative language.

The key principles that run across all three modalities:

Be specific. Concrete details outperform vague descriptions in every model.
Start with a verb. Tell the model the primary operation before describing the content.
Use structured formats. Pseudo-code, JSON, timestamp notation — structure gives the model clear constraints.
Reference stacks are your power tool. Up to 14 images for Nano Banana, 4 for Veo 3.1. Use them.
Iterate. Run generations 3-4 times. The first output is rarely the best — refinement is part of the creative process.
Combine modalities. Nano Banana keyframes + Veo 3.1 animation + Lyria scoring = professional-grade productions.

Resources

Awesome Nano Banana Pro Repository — Full prompt collection
Nano Banana Images Gallery — Example outputs
NanoPrompts.org — 400+ curated prompts and workflows
Nano-consistent-150k Dataset — Identity-consistent training data
Google Cloud — Ultimate Nano Banana Prompting Guide
Google Cloud — Ultimate Veo 3.1 Prompting Guide
Google DeepMind — Lyria Prompt Guide
Chase Jarvis — Nano Banana Pro Guide
Chase Jarvis — Weavy Pose-Change
Chase Jarvis — Weavy Style Cloning

This collection is curated from Google Cloud Blog, Google DeepMind documentation, NanoPrompts.org, Chase Jarvis's professional workflow guides, and community contributions on X (Twitter), WeChat, and other platforms. All prompts retain their original sources for attribution.

The Ultimate Multimodal Prompt Collection: Nano Banana, Lyria & Veo 3.1 — 75+ Expert Examples

On this page

Featured Products

5,000+ n8n Workflows - The Ultimate Automation Bundle

Free QR Code Generator

MigrateCMS Tool