Multimodal Prompting
Working with images, audio, and video
For most of history, computers worked with one type of data at a time: text in one program, images in another, audio somewhere else. But humans don't experience the world this way. We see, hear, read, and speak simultaneously, combining all these inputs to understand our environment.
Multimodal AI changes everything. These models can process multiple types of information together—analyzing an image while reading your question about it, or generating images from your text descriptions. This chapter teaches you how to communicate effectively with these powerful systems.
"Multi" means many, and "modal" refers to modes or types of data. A multimodal model can work with multiple modalities: text, images, audio, video, or even code. Instead of separate tools for each type, one model understands them all together.
Why Multimodal Matters
Traditional AI required you to describe everything in words. Want to ask about an image? You'd have to describe it first. Want to analyze a document? You'd need to transcribe it manually. Multimodal models eliminate these barriers.
Upload an image and ask questions about it directly—no description needed
"What's wrong with this circuit diagram?"
Describe what you want and generate images, audio, or video
"A sunset over mountains in watercolor style"
Mix text, images, and other media in a single conversation
"Compare these two designs and tell me which is better for mobile"
Extract information from photos of documents, receipts, or screenshots
"Extract all the line items from this invoice photo"
Why Prompting Matters Even More for Multimodal
With text-only models, the AI receives exactly what you type. But with multimodal models, the AI must interpret visual or audio information—and interpretation requires guidance.
Vague multimodal prompt
What do you see in this image? [image of a complex dashboard]
Guided multimodal prompt
This is a screenshot of our analytics dashboard. Focus on: 1. The conversion rate graph in the top-right 2. Any error indicators or warnings 3. Whether the data looks normal or anomalous [image of a complex dashboard]
Without guidance, the model might describe colors, layout, or irrelevant details. With guidance, it focuses on what actually matters to you.
When you look at an image, you instantly know what's important based on your context and goals. The AI doesn't have this context unless you provide it. A photo of a crack in a wall could be: a structural engineering concern, an artistic texture, or irrelevant background. Your prompt determines how the AI interprets it.
The Multimodal Landscape
Different models have different capabilities. Here's what's available in 2025:
Understanding Models (Input → Analysis)
These models accept various media types and produce text analysis or responses.
Text + Images + Audio → Text. OpenAI's flagship with 128K context, strong creative and reasoning abilities, reduced hallucination rates.
Text + Images → Text. Anthropic's safety-focused model with advanced reasoning, excellent for coding and complex multi-step tasks.
Text + Images + Audio + Video → Text. Google's model with 1M token context, self-fact-checking, fast processing for coding and research.
Text + Images + Video → Text. Meta's open-source model with massive 10M token context for long documents and codebases.
Text + Images → Text. xAI's model with real-time data access and social media integration for up-to-date responses.
Generation Models (Text → Media)
These models create images, audio, or video from text descriptions.
Text → Images. OpenAI's image generator with high accuracy to prompt descriptions.
Text + Images → Images. Known for artistic quality, style control, and aesthetic outputs.
Text → Video. OpenAI's video generation model for creating clips from descriptions.
Audio → Text. OpenAI's speech-to-text with high accuracy across languages.
The multimodal landscape changes quickly. New models launch frequently, and existing models gain capabilities through updates. Always check the latest documentation for current features and limitations.
Image Understanding Prompts
The most common multimodal use case is asking AI to analyze images. The key is providing context about what you need.
Basic Image Analysis
Start with a clear request structure. Tell the model what aspects to focus on.
This prompt provides a clear framework for image analysis. The model knows exactly what information you need.
Analyze this image and describe:
1. **Main Subject**: What is the primary focus of this image?
2. **Setting**: Where does this appear to be? (indoor/outdoor, location type)
3. **Mood**: What emotional tone or atmosphere does it convey?
4. **Text Content**: Any visible text, signs, or labels?
5. **Notable Details**: What might someone miss at first glance?
6. **Technical Quality**: How is the lighting, focus, and composition?
[Paste or describe the image you want to analyze]
Image description or URL: ${imageDescription}Structured Output for Images
When you need to process image analysis programmatically, request JSON output.
Get structured data from image analysis that's easy to parse and use in applications.
Analyze this image and return a JSON object with the following structure:
{
"summary": "One sentence description",
"objects": ["List of main objects visible"],
"people": {
"count": "number or 'none'",
"activities": ["What they're doing, if any"]
},
"text_detected": ["Any text visible in the image"],
"colors": {
"dominant": ["Top 3 colors"],
"mood": "Warm/Cool/Neutral"
},
"setting": {
"type": "indoor/outdoor/unknown",
"description": "More specific location description"
},
"technical": {
"quality": "high/medium/low",
"lighting": "Description of lighting",
"composition": "Description of framing/composition"
},
"confidence": "high/medium/low"
}
Image to analyze: ${imageDescription}Comparative Analysis
Comparing multiple images requires clear labeling and specific comparison criteria.
Compare two or more images with specific criteria that matter to your decision.
Compare these images for ${purpose}:
**Image A**: ${imageA}
**Image B**: ${imageB}
Analyze each image on these criteria:
1. ${criterion1} (importance: high)
2. ${criterion2} (importance: medium)
3. ${criterion3} (importance: low)
Provide:
- Side-by-side comparison for each criterion
- Strengths and weaknesses of each
- Clear recommendation with reasoning
- Any concerns or caveatsDocument and Screenshot Analysis
One of the most practical applications of multimodal AI is analyzing documents, screenshots, and UI elements. This saves hours of manual transcription and review.
Document Extraction
Scanned documents, photos of receipts, and PDFs as images can all be processed. The key is telling the model what type of document it is and what information you need.
Extract structured data from photos of documents, receipts, invoices, or forms.
This is a photo/scan of a ${documentType}.
Extract all information into structured JSON format:
{
"document_type": "detected type",
"date": "if present",
"key_fields": {
"field_name": "value"
},
"line_items": [
{"description": "", "amount": ""}
],
"totals": {
"subtotal": "",
"tax": "",
"total": ""
},
"handwritten_notes": ["any handwritten text"],
"unclear_sections": ["areas that were hard to read"],
"confidence": "high/medium/low"
}
IMPORTANT: If any text is unclear, note it in "unclear_sections" rather than guessing. Mark confidence as "low" if significant portions were hard to read.
Document description: ${documentDescription}Screenshot and UI Analysis
Screenshots are goldmines for debugging, UX review, and documentation. Guide the AI to focus on what matters.
Get detailed analysis of screenshots for debugging, UX review, or documentation.
This is a screenshot of ${applicationName}.
Analyze this interface:
**Identification**
- What screen/page/state is this?
- What is the user likely trying to accomplish here?
**UI Elements**
- Key interactive elements (buttons, forms, menus)
- Current state (anything selected, filled in, or expanded?)
- Any error messages, warnings, or notifications?
**UX Assessment**
- Is the layout clear and intuitive?
- Any confusing elements or unclear labels?
- Accessibility concerns (contrast, text size, etc.)?
**Issues Detected**
- Visual bugs or misalignments?
- Truncated text or overflow issues?
- Inconsistent styling?
Screenshot description: ${screenshotDescription}Error Message Analysis
When you encounter an error, a screenshot often contains more context than copying the error text alone.
Get plain-language explanations and fixes for error messages in screenshots.
I'm seeing this error in ${context}.
[Describe or paste the error message/screenshot]
Error details: ${errorDetails}
Please provide:
1. **Plain Language Explanation**: What does this error actually mean?
2. **Likely Causes** (ranked by probability):
- Most likely:
- Also possible:
- Less common:
3. **Step-by-Step Fix**:
- First, try...
- If that doesn't work...
- As a last resort...
4. **Prevention**: How to avoid this error in the future
5. **Red Flags**: When this error might indicate a more serious problemImage Generation Prompts
Generating images from text descriptions is an art form. The more specific and structured your prompt, the closer the result will match your vision.
The Anatomy of an Image Prompt
Effective image generation prompts have several components:
What is the main focus of the image?
A golden retriever playing in autumn leaves
What artistic style or medium?
Watercolor painting, digital art, photorealistic
How is the scene arranged?
Close-up portrait, wide landscape, bird's eye view
What's the light source and quality?
Soft morning light, dramatic shadows, neon glow
What feeling should it evoke?
Peaceful, energetic, mysterious, nostalgic
Specific elements to include or avoid
Include: flowers. Avoid: text, watermarks
Basic Image Generation
Use this template to create detailed, specific image generation prompts.
Create an image with these specifications:
**Subject**: ${subject}
**Style**: ${style}
**Medium**: ${medium} (e.g., oil painting, digital art, photograph)
**Composition**:
- Framing: ${framing} (close-up, medium shot, wide angle)
- Perspective: ${perspective} (eye level, low angle, overhead)
- Focus: ${focusArea}
**Lighting**:
- Source: ${lightSource}
- Quality: ${lightQuality} (soft, harsh, diffused)
- Time of day: ${timeOfDay}
**Color Palette**: ${colors}
**Mood/Atmosphere**: ${mood}
**Must Include**: ${includeElements}
**Must Avoid**: ${avoidElements}
**Technical**: ${aspectRatio} aspect ratio, high qualityScene Building
For complex scenes, describe layers from foreground to background.
Build complex scenes by describing what appears in each layer of depth.
Generate a detailed scene:
**Setting**: ${setting}
**Foreground** (closest to viewer):
${foreground}
**Middle Ground** (main action area):
${middleGround}
**Background** (distant elements):
${background}
**Atmospheric Details**:
- Weather/Air: ${weather}
- Lighting: ${lighting}
- Time: ${timeOfDay}
**Style**: ${artisticStyle}
**Mood**: ${mood}
**Color Palette**: ${colors}
Additional details to include: ${additionalDetails}Audio Prompting
Audio processing opens up transcription, analysis, and understanding of spoken content. The key is providing context about what the audio contains.
Enhanced Transcription
Basic transcription is just the start. With good prompts, you can get speaker identification, timestamps, and domain-specific accuracy.
Get accurate transcriptions with speaker labels, timestamps, and handling of unclear sections.
Transcribe this audio recording.
**Context**: ${recordingType} (meeting, interview, podcast, lecture, etc.)
**Expected Speakers**: ${speakerCount} (${speakerRoles})
**Domain**: ${domain} (technical terms to expect: ${technicalTerms})
**Output Format**:
[00:00] **Speaker 1 (Name/Role)**: Transcribed text here.
[00:15] **Speaker 2 (Name/Role)**: Their response here.
**Instructions**:
- Include timestamps at natural breaks (every 30-60 seconds or at speaker changes)
- Mark unclear sections as [inaudible] or [unclear: best guess?]
- Note non-speech sounds in brackets: [laughter], [phone ringing], [long pause]
- Preserve filler words only if they're meaningful (um, uh can be removed)
- Flag any action items or decisions with → symbol
Audio description: ${audioDescription}Audio Content Analysis
Beyond transcription, AI can analyze the content, tone, and key moments in audio.
Get a comprehensive analysis of audio content including summary, key moments, and sentiment.
Analyze this audio recording:
Audio description: ${audioDescription}
Provide:
**1. Executive Summary** (2-3 sentences)
What is this recording about? What's the main takeaway?
**2. Speakers**
- How many distinct speakers?
- Characteristics (if discernible): tone, speaking style, expertise level
**3. Content Breakdown**
- Main topics discussed (with approximate timestamps)
- Key points made
- Questions raised
**4. Emotional Analysis**
- Overall tone (formal, casual, tense, friendly)
- Notable emotional moments
- Energy level throughout
**5. Actionable Items**
- Decisions made
- Action items mentioned
- Follow-ups needed
**6. Notable Quotes**
Pull out 2-3 significant quotes with timestamps
**7. Audio Quality**
- Overall clarity
- Any issues (background noise, interruptions, technical problems)Video Prompting
Video combines visual and audio analysis over time. The challenge is guiding the AI to focus on the relevant aspects across the entire duration.
Video Understanding
Get a structured breakdown of video content including timeline, visual elements, and key moments.
Analyze this video: ${videoDescription}
Provide a comprehensive analysis:
**1. Overview** (2-3 sentences)
What is this video about? What's the main message or purpose?
**2. Timeline of Key Moments**
| Timestamp | Event | Significance |
|-----------|-------|--------------|
| 0:00 | ... | ... |
**3. Visual Analysis**
- Setting/Location: Where does this take place?
- People: Who appears? What are they doing?
- Objects: Key items or props featured
- Visual style: Quality, editing, graphics used
**4. Audio Analysis**
- Speech: Main points made (if any dialogue)
- Music: Type, mood, how it's used
- Sound effects: Notable audio elements
**5. Production Quality**
- Video quality and editing
- Pacing and structure
- Effectiveness for its purpose
**6. Target Audience**
Who is this video made for? Does it serve them well?
**7. Key Takeaways**
What should a viewer remember from this video?Video Content Extraction
For specific information extraction from videos, be precise about what you need.
Extract specific information from videos with timestamps and structured output.
Extract specific information from this video:
Video type: ${videoType}
Video description: ${videoDescription}
**Information to Extract**:
1. ${extractItem1}
2. ${extractItem2}
3. ${extractItem3}
**Output Format**:
{
"video_summary": "Brief description",
"duration": "estimated length",
"extracted_data": [
{
"timestamp": "MM:SS",
"item": "What was found",
"details": "Additional context",
"confidence": "high/medium/low"
}
],
"items_not_found": ["List anything requested but not present"],
"additional_observations": "Anything relevant not explicitly requested"
}Multimodal Combinations
The real power of multimodal AI emerges when you combine different types of input. These combinations enable analysis that would be impossible with any single modality.
Image + Text Verification
Check if images and their descriptions match—essential for e-commerce, content moderation, and quality assurance.
Verify that images accurately represent their text descriptions and vice versa.
Analyze this image and its accompanying text for alignment:
**Image**: ${imageDescription}
**Text Description**: "${textDescription}"
Evaluate:
**1. Accuracy Match**
- Does the image show what the text describes?
- Score: [1-10] with explanation
**2. Text Claims vs. Visual Reality**
| Claim in Text | Visible in Image? | Notes |
|---------------|-------------------|-------|
| ... | Yes/No/Partial | ... |
**3. Visual Elements Not Mentioned**
What's visible in the image but not described in the text?
**4. Text Claims Not Visible**
What's described in text but can't be verified from the image?
**5. Recommendations**
- For the text: [improvements to match image]
- For the image: [improvements to match text]
**6. Overall Assessment**
Is this image-text pair trustworthy for ${purpose}?Screenshot + Code Debugging
One of the most powerful combinations for developers: seeing the visual bug alongside the code.
Debug UI issues by analyzing both the visual output and the source code together.
I have a UI bug. Here's what I see and my code:
**Screenshot Description**: ${screenshotDescription}
**What's Wrong**: ${bugDescription}
**Expected Behavior**: ${expectedBehavior}
**Relevant Code**:
```${language}
${code}
```
Please help me:
**1. Root Cause Analysis**
- What in the code is causing this visual issue?
- Which specific line(s) are responsible?
**2. Explanation**
- Why does this code produce this visual result?
- What's the underlying mechanism?
**3. The Fix**
```${language}
// Corrected code here
```
**4. Prevention**
- How to avoid this type of bug in the future
- Any related issues to check forMulti-Image Decision Making
When choosing between options, structured comparison helps make better decisions.
Compare multiple images systematically against your criteria to make informed decisions.
I'm choosing between these options for ${purpose}:
**Option A**: ${optionA}
**Option B**: ${optionB}
**Option C**: ${optionC}
**My Criteria** (in order of importance):
1. ${criterion1} (weight: high)
2. ${criterion2} (weight: medium)
3. ${criterion3} (weight: low)
Provide:
**Comparison Matrix**
| Criterion | Option A | Option B | Option C |
|-----------|----------|----------|----------|
| ${criterion1} | Score + notes | ... | ... |
| ${criterion2} | ... | ... | ... |
| ${criterion3} | ... | ... | ... |
**Weighted Scores**
- Option A: X/10
- Option B: X/10
- Option C: X/10
**Recommendation**
Based on your stated priorities, I recommend [Option] because...
**Caveats**
- If [condition], consider [alternative] instead
- Watch out for [potential issue]Best Practices for Multimodal Prompts
Getting great results from multimodal AI requires understanding both its capabilities and limitations.
What Makes Multimodal Prompts Effective
Tell the model what the media is and why you're analyzing it
"This is a product photo for our e-commerce site..."
Ask about particular elements rather than general impressions
"Focus on the pricing table in the top-right corner"
Point to specific areas using spatial language
"In the bottom-left quadrant..."
Explain what you'll use the analysis for
"I need to decide if this image works for our mobile app"
Common Pitfalls to Avoid
Models may miss small details, especially in low-resolution images
Don't ask about 8pt text in a compressed screenshot
Handwriting, unusual fonts, and complex layouts can cause errors
Verify extracted text from receipts and forms
Models have restrictions on certain types of content
Won't identify specific individuals or analyze inappropriate content
Always verify critical information extracted from media
Double-check numbers, dates, and names from document extraction
Handling Limitations Gracefully
This prompt explicitly handles cases where the model can't see clearly or is uncertain.
Analyze this image: ${imageDescription}
**Instructions for Handling Uncertainty**:
IF YOU CAN'T SEE SOMETHING CLEARLY:
- Don't guess or make up details
- Say: "I can see [what's visible] but cannot clearly make out [unclear element]"
- Suggest what additional information would help
IF CONTENT SEEMS RESTRICTED:
- Explain what you can and cannot analyze
- Focus on permitted aspects of the analysis
IF ASKED ABOUT PEOPLE:
- Describe actions, positions, and general characteristics
- Do not attempt to identify specific individuals
- Focus on: number of people, activities, expressions, attire
**Your Analysis**:
[Proceed with analysis, applying these guidelines]Why does prompting matter MORE for multimodal models than for text-only models?