📚 Course
Intermediate
~2–3h

Multimodal AI

Vision, Voice & Video — AI That Understands the World Beyond Text

The first wave of AI was about text. The second wave is about everything else. GPT-4o can read your whiteboard photo, analyze a medical scan, and describe a screenshot. ElevenLabs can clone a voice from 30 seconds of audio. Runway generates 10-second videos from a text prompt. This course explains how these systems actually work — and how to use them effectively and responsibly.
Intermediate
~2–3 hours
9 Modules

TL;DR:

Multimodal AI processes multiple input types — image, audio, video, text — within a single model. Vision models like GPT-4o and Gemini convert images into tokens before reasoning about them. Voice AI tools like Whisper and ElevenLabs handle transcription and synthesis. Video generation (Runway, Kling) is powerful but expensive and imperfect — especially for hands, physics, and consistency. Each modality introduces specific risks around deepfakes, hallucination, and copyright that you need to understand before deploying.

What “multimodal” actually means

A multimodal AI model can process more than one type of input — or produce more than one type of output. This is a significant departure from early language models that only handled text.

Vision (Image → Text)

The model receives an image and produces a text response. Use cases: document analysis, screenshot description, chart interpretation, medical imaging assistance, product photo QA.

Audio (Speech ↔ Text)

Speech-to-text (transcription with Whisper) or text-to-speech (voice synthesis with ElevenLabs, Azure TTS). Some models like GPT-4o can handle audio natively end-to-end.

Video (Text/Image → Video)

Generate short video clips from text prompts or reference images. Runway Gen-4, Kling 3.0, and Google Veo are the current leaders. Typical output: 5–10 seconds at 720p–1080p.

How vision models process images

Vision models don't “see” in the way humans do. They convert images into numerical tokens, then reason about those tokens alongside your text. Understanding this pipeline helps you write better prompts and interpret model failures.

1

Image Encoding

A vision encoder (typically a Vision Transformer, or ViT) slices the image into a grid of patches — usually 14×14 pixels each. Each patch is mapped to an embedding vector. GPT-4o's low-resolution mode generates 65 image tokens (a 64-tile grid + 1 global token). In high-resolution mode, an image can generate up to 6,240 tokens, which is why detailed images are more expensive and slower to process.

2

Cross-Modal Attention

Image tokens are concatenated with text tokens and fed into the transformer's attention layers. The model learns to attend to relevant image regions when answering text questions — e.g., when asked “what color is the car?”, attention focuses on the car patches. This is why context matters: your text prompt shapes which image regions get emphasized.

3

Text Generation

The decoder generates tokens autoregressively, drawing on both image and text context. The model has no separate “image understanding” module — it's the same language model, now operating over a richer token sequence. This means text-based prompting techniques (role assignment, step-by-step reasoning, specificity) all apply equally to vision tasks.

Detail LevelTokens Used (GPT-4o)Best ForCost Signal
Low resolution65 tokensGeneral description, simple content detection
Low
High resolution (auto)129–6,240 tokensReading text in images, fine detail analysis, OCR
High

Source: OpenAI Vision documentation (2025). Token counts apply to GPT-4o; other models use different encoding schemes.

Major vision AI platforms compared

All frontier models now support vision input. The differences lie in image token limits, document understanding quality, and context window size.

ModelMax ImagesContext WindowStrengthsLimitations
GPT-4o≤20 per call128K tokensOCR, instruction-following, broad task coverageCan miss fine spatial detail; image token cost adds up
Gemini 1.5 ProUp to 3,000 frames (video)1M tokensLong video & multi-image reasoning, native audioHigher latency on large payloads; pricing tiers complex
Claude 3.7 Sonnet≤20 per call200K tokensDocument layout, table extraction, careful reasoningNo image generation; no video input
Llama 3.2 Vision (local)1 per call128K tokensRuns locally, full privacy, no API costSingle image only; lower accuracy vs frontier models

Vision prompting strategies

Because image tokens are processed alongside text tokens, your text prompt directly shapes what the model pays attention to. These strategies consistently improve vision output quality.

Be specific about the task

Weak

"What is this?"

Stronger

"Extract all text from this receipt and return it as a JSON object with fields: store_name, date, total_amount, line_items."

Vague questions trigger generic descriptions. Specific task framing triggers structured reasoning.

Name the region of interest

Weak

"Describe the image."

Stronger

"Focus on the bottom-right quadrant. What numbers appear in the table in that area?"

Models attend more strongly to regions you explicitly reference in your prompt.

Use chain-of-thought for visual analysis

Weak

"Is this graph showing growth?"

Stronger

"Look at the graph. First identify the axes and units. Then describe the trend. Then give a yes/no answer with one supporting data point."

Step-by-step reasoning reduces hallucination on charts, diagrams, and ambiguous images.

Provide context the image lacks

Weak

(Upload a photo of a document in German without context)

Stronger

"This is a German tax document from 2024. Extract the Steueridentifikationsnummer and the Bemessungsgrundlage."

Models perform better when they know the domain, language, and what matters to you.

Ready-to-use vision prompts

You are a precise document extraction assistant.

Image: [attach image]

Task: Extract all text from this document.
Format the output as structured JSON with these fields:
- document_type (invoice / contract / form / other)
- date (ISO 8601 if present, else null)
- key_fields (object with any named fields you find)
- full_text (all text, preserving line breaks)

If any field is unclear or absent, use null. Do not infer or hallucinate values.
Analyze the chart or graph in the image.

Step 1 — Identify: What type of chart is this? What are the axes, units, and time range?
Step 2 — Describe: What is the main trend? Are there notable peaks, dips, or inflection points?
Step 3 — Quantify: Quote specific values from the chart where readable.
Step 4 — Conclude: In one sentence, what is the key takeaway?

Flag any values that are unclear or estimated.

Voice AI: transcription & synthesis

Voice AI splits into two distinct tasks: speech-to-text (transcription) and text-to-speech (synthesis). Both have reached production quality — with significant implications for accessibility, content creation, and fraud.

Speech-to-Text (Transcription)

Whisper (OpenAI, open-source) is the current benchmark for offline transcription — supports 99 languages, handles accents well, runs locally for free. API access via OpenAI is $0.006/minute.

Assembly AI, Deepgram, and Rev.ai offer real-time streaming transcription with speaker diarization (who said what). These are the standard choice for meeting transcription and podcast workflows.

Accuracy benchmark (English, clean audio)

Whisper Large v32.7% WER
Google Speech-to-Text v23.1% WER
Azure Speech3.4% WER

WER = Word Error Rate. Lower is better. Source: Hugging Face Open ASR Leaderboard 2025.

Text-to-Speech (Synthesis)

ElevenLabs is the industry standard for natural-sounding synthesis and voice cloning. A 30-second voice sample is enough to generate unlimited speech in that voice. Cost: under $1 per 30 seconds of output — compared to $50–$200/hour for a professional voice actor.

A 2025 study by Queen Mary University of London found AI-synthesized voices were indistinguishable from human voices 60% of the time in blind listening tests — even for trained listeners.

Practical voice AI workflows

  1. 1Meeting transcription: Record → Whisper or Otter.ai → GPT-4o summary with action items. Free with local Whisper; ~$10/month with cloud tools.
  2. 2Podcast production: ElevenLabs Dubbing for translation into 29 languages, preserving original voice characteristics. Used by major podcast networks.
  3. 3Accessibility: Real-time captions for video calls using Azure Cognitive Services or Google Live Caption API.
  4. 4Content voiceover: ElevenLabs or OpenAI TTS for consistent narration. Always disclose AI voice usage to your audience.

Video generation: the state of the art

AI video generation advanced more in 2025 than in the previous five years combined. The market is consolidating around a few clear leaders, each with distinct strengths.

ToolMax LengthResolutionInput ModesCost (approx.)Best For
Runway Gen-410s1080pText, image, video$0.05/secondCinematic shots, consistent characters
Kling 3.010s1080pText, image$0.04/secondRealistic motion, native multimodal training
Google Veo 28s1080pText, imageVertex AI pricingPhysics simulation, natural scenes
Pika Labs 2.05s720pText, image$15/mo (150 videos)Quick iterations, beginner-friendly

What current video AI still gets wrong

Hands & fingers

Finger count, grip, and hand anatomy remain unreliable. A common tell in AI video.

Text in video

Rendered text (signs, labels, credits) is frequently garbled or changes between frames.

Physics & continuity

Objects pass through each other, gravity behaves incorrectly, and props appear/disappear between cuts.

Long-form coherence

Consistency across multiple clips (same character, same lighting, same environment) requires careful prompt engineering and tools like Runway's Act One.

Real-world multimodal workflows

The highest-value applications combine multiple modalities in sequence. Here are four production-tested workflows used by teams today.

Invoice & receipt processing pipeline

  1. Upload invoice image to GPT-4o with extraction prompt
  2. Receive structured JSON (vendor, date, amount, line items)
  3. Validate JSON against business rules (e.g., currency, VAT format)
  4. Push to accounting system via API

Processing time: ~3 seconds vs 4–10 minutes manual. Accuracy: 96–98% on clean scans. Needs human review for poor-quality images.

Meeting intelligence pipeline

  1. Record Zoom/Teams call via Grain, Fireflies, or similar
  2. Whisper v3 transcribes with speaker diarization
  3. GPT-4o extracts decisions, action items, owners, deadlines
  4. Summary pushed to Notion, Slack, or email automatically

Used by teams ranging from 5 to 5,000 people. Key benefit: searchable meeting archive. Key risk: participants must be informed they are being recorded and processed.

Social media video production

  1. Write a short script (GPT-4o or Claude)
  2. Generate voiceover with ElevenLabs (consistent AI voice)
  3. Generate B-roll clips with Runway Gen-4 or Kling
  4. Assemble in CapCut, Premiere, or Descript
  5. Add captions via WhisperKit or Descript auto-captions

Production time: 2–4 hours vs 2–3 days traditional. Cost: ~$20–50/video. Disclosure required by EU AI Act and FTC guidelines for sponsored content.

Product image QA at scale

  1. Batch-upload product photos to vision API
  2. Prompt: "Check for defects, correct labeling, and brand compliance. Return pass/fail with reason."
  3. Flag failures for human review queue
  4. Log results to QA dashboard

Reduces QA bottleneck in e-commerce and manufacturing. Not a replacement for regulated inspection (medical devices, food safety) — use as a triage filter only.

Risks & Responsible Use

Know these before you go further.

Deepfakes & non-consensual synthetic media

Voice cloning from 30 seconds of audio and face-swap video generation are now accessible to anyone with a credit card. These technologies are already used for fraud (CEO voice impersonation), political disinformation, and non-consensual intimate imagery.

What this means for you

Only clone voices or likenesses with explicit written consent. Apply watermarking or metadata standards (C2PA) to AI-generated media you publish. Check local laws — Germany (§ 201a StGB), UK, and several US states have enacted synthetic media legislation.

Vision hallucination & over-trust

Vision models confidently describe things that aren't there — especially in low-resolution images, partially obscured content, and out-of-distribution scenarios (unusual chart types, non-Latin scripts, medical imagery outside training distribution).

What this means for you

Always validate critical vision outputs against ground truth. Use low-resolution mode for classification tasks; high-resolution for reading text. Never use vision AI as sole input for safety-critical or legal decisions.

Copyright in AI-generated visual content

AI-generated images and video lack copyright protection in most jurisdictions (Thaler v. Perlmutter, 2025). However, the training data used to create these models — and the outputs they produce — may infringe on existing copyrights in ways that remain legally untested.

What this means for you

Review your tool's terms: Runway grants you commercial rights to outputs. Getty Images and Shutterstock offer AI generators trained on licensed data for higher commercial safety. Avoid prompts that explicitly reference specific copyrighted characters or styles.

Disclosure & transparency obligations

EU AI Act Article 50 requires disclosure when AI interacts with humans in voice or video (chatbots, AI avatars). US FTC guidelines require disclosure for AI-generated sponsored content. Failure to disclose synthetic voice or video in advertising is an emerging regulatory risk.

What this means for you

Add visible "AI-generated" labels to synthetic media. For voice AI in customer-facing products: announce "This is an AI assistant" at the start of interactions. Keep records of which content was AI-generated for at least 3 years.

Test your knowledge

Loading quiz...

Your 30-minute multimodal starter challenge

  1. 1Vision: Upload a screenshot, photo of a handwritten note, or product image to ChatGPT or Claude. Use the document extraction template above. Verify the output against the original.
  2. 2Transcription: Record a 2-minute voice memo. Run it through Whisper (free via whisper.ai or locally with whisper.cpp). Compare accuracy to what you said.
  3. 3Video: Sign up for Runway or Kling free tier. Generate a 5-second clip from a text prompt. Note: where does it fail? (Hands? Physics? Text?)
Reflect: For each modality — what would you trust this for in a professional context? What would you never use it for without human review?

Key Insights: What You've Learned

1

Multimodal AI processes images, audio, and video natively within a single transformer — vision models convert images into tokens (65 in low-res, up to 6,240 in high-res for GPT-4o) and your text prompt shapes which image regions the model attends to. Whisper achieves ~2.7% word error rate on English; ElevenLabs voice synthesis is indistinguishable from human voices 60% of the time (QMU 2025).

2

AI video generation (Runway Gen-4, Kling 3.0, Google Veo 2) is production-ready for short cinematic clips but still fails on hands, text in video, and long-form consistency. OpenAI Sora was discontinued in early 2026. Runway reported $300M ARR and a $5.3B valuation as of February 2026.

3

The EU AI Act (Article 50) requires disclosure whenever AI interacts with humans via voice or text in ways that could be mistaken for human interaction. Voice cloning without consent is illegal in multiple jurisdictions. Always label AI-generated media and keep records for at least 3 years.