📚 Course

Intermediate

~2–3h

Multimodal AI

Vision, Voice & Video — AI That Understands the World Beyond Text

The first wave of AI was about text. The second wave is about everything else. GPT-4o can read your whiteboard photo, analyze a medical scan, and describe a screenshot. ElevenLabs can clone a voice from 30 seconds of audio. Runway generates 10-second videos from a text prompt. This course explains how these systems actually work — and how to use them effectively and responsibly.

Intermediate

~2–3 hours

9 Modules

TL;DR:

Multimodal AI processes multiple input types — image, audio, video, text — within a single model. Vision models like GPT-4o and Gemini convert images into tokens before reasoning about them. Voice AI tools like Whisper and ElevenLabs handle transcription and synthesis. Video generation (Runway, Kling) is powerful but expensive and imperfect — especially for hands, physics, and consistency. Each modality introduces specific risks around deepfakes, hallucination, and copyright that you need to understand before deploying.

Guide by

Albert Schaper

What “multimodal” actually means

A multimodal AI model can process more than one type of input — or produce more than one type of output. This is a significant departure from early language models that only handled text.

Vision (Image → Text)

The model receives an image and produces a text response. Use cases: document analysis, screenshot description, chart interpretation, medical imaging assistance, product photo QA.

Audio (Speech ↔ Text)

Speech-to-text (transcription with Whisper) or text-to-speech (voice synthesis with ElevenLabs, Azure TTS). Some models like GPT-4o can handle audio natively end-to-end.

Video (Text/Image → Video)

Generate short video clips from text prompts or reference images. Runway Gen-4, Kling 3.0, and Google Veo are the current leaders. Typical output: 5–10 seconds at 720p–1080p.

Native vs. pipeline multimodal: A native multimodal model like GPT-4o was trained end-to-end on all modalities simultaneously — it doesn't convert images to text before reasoning. A pipeline approach chains separate models (e.g., Whisper → GPT-4 → TTS). Native models are faster, more coherent, and better at cross-modal reasoning.

How vision models process images

Vision models don't “see” in the way humans do. They convert images into numerical tokens, then reason about those tokens alongside your text. Understanding this pipeline helps you write better prompts and interpret model failures.

Image Encoding

A vision encoder (typically a Vision Transformer, or ViT) slices the image into a grid of patches — usually 14×14 pixels each. Each patch is mapped to an embedding vector. GPT-4o's low-resolution mode generates 65 image tokens (a 64-tile grid + 1 global token). In high-resolution mode, an image can generate up to 6,240 tokens, which is why detailed images are more expensive and slower to process.

Cross-Modal Attention

Image tokens are concatenated with text tokens and fed into the transformer's attention layers. The model learns to attend to relevant image regions when answering text questions — e.g., when asked “what color is the car?”, attention focuses on the car patches. This is why context matters: your text prompt shapes which image regions get emphasized.

Text Generation

The decoder generates tokens autoregressively, drawing on both image and text context. The model has no separate “image understanding” module — it's the same language model, now operating over a richer token sequence. This means text-based prompting techniques (role assignment, step-by-step reasoning, specificity) all apply equally to vision tasks.

Detail Level	Tokens Used (GPT-4o)	Best For	Cost Signal
Low resolution	65 tokens	General description, simple content detection	Low
High resolution (auto)	129–6,240 tokens	Reading text in images, fine detail analysis, OCR	High

Source: OpenAI Vision documentation (2025). Token counts apply to GPT-4o; other models use different encoding schemes.

Major vision AI platforms compared

All frontier models now support vision input. The differences lie in image token limits, document understanding quality, and context window size.

Model	Max Images	Context Window	Strengths	Limitations
GPT-4o	≤20 per call	128K tokens	OCR, instruction-following, broad task coverage	Can miss fine spatial detail; image token cost adds up
Gemini 1.5 Pro	Up to 3,000 frames (video)	1M tokens	Long video & multi-image reasoning, native audio	Higher latency on large payloads; pricing tiers complex
Claude 3.7 Sonnet	≤20 per call	200K tokens	Document layout, table extraction, careful reasoning	No image generation; no video input
Llama 3.2 Vision (local)	1 per call	128K tokens	Runs locally, full privacy, no API cost	Single image only; lower accuracy vs frontier models

Vision prompting strategies

Because image tokens are processed alongside text tokens, your text prompt directly shapes what the model pays attention to. These strategies consistently improve vision output quality.

Be specific about the task

Weak

"What is this?"

Stronger

"Extract all text from this receipt and return it as a JSON object with fields: store_name, date, total_amount, line_items."

Vague questions trigger generic descriptions. Specific task framing triggers structured reasoning.

Name the region of interest

Weak

"Describe the image."

Stronger

"Focus on the bottom-right quadrant. What numbers appear in the table in that area?"

Models attend more strongly to regions you explicitly reference in your prompt.

Use chain-of-thought for visual analysis

Weak

"Is this graph showing growth?"

Stronger

"Look at the graph. First identify the axes and units. Then describe the trend. Then give a yes/no answer with one supporting data point."

Step-by-step reasoning reduces hallucination on charts, diagrams, and ambiguous images.

Provide context the image lacks

Weak

(Upload a photo of a document in German without context)

Stronger

"This is a German tax document from 2024. Extract the Steueridentifikationsnummer and the Bemessungsgrundlage."

Models perform better when they know the domain, language, and what matters to you.

Ready-to-use vision prompts

You are a precise document extraction assistant.

Image: [attach image]

Task: Extract all text from this document.
Format the output as structured JSON with these fields:
- document_type (invoice / contract / form / other)
- date (ISO 8601 if present, else null)
- key_fields (object with any named fields you find)
- full_text (all text, preserving line breaks)

If any field is unclear or absent, use null. Do not infer or hallucinate values.

Analyze the chart or graph in the image.

Step 1 — Identify: What type of chart is this? What are the axes, units, and time range?
Step 2 — Describe: What is the main trend? Are there notable peaks, dips, or inflection points?
Step 3 — Quantify: Quote specific values from the chart where readable.
Step 4 — Conclude: In one sentence, what is the key takeaway?

Flag any values that are unclear or estimated.

Voice AI: transcription & synthesis

Voice AI splits into two distinct tasks: speech-to-text (transcription) and text-to-speech (synthesis). Both have reached production quality — with significant implications for accessibility, content creation, and fraud.

Speech-to-Text (Transcription)

Whisper (OpenAI, open-source) is the current benchmark for offline transcription — supports 99 languages, handles accents well, runs locally for free. API access via OpenAI is $0.006/minute.

Assembly AI, Deepgram, and Rev.ai offer real-time streaming transcription with speaker diarization (who said what). These are the standard choice for meeting transcription and podcast workflows.

Accuracy benchmark (English, clean audio)

Whisper Large v32.7% WER

Google Speech-to-Text v23.1% WER

Azure Speech3.4% WER

WER = Word Error Rate. Lower is better. Source: Hugging Face Open ASR Leaderboard 2025.

Text-to-Speech (Synthesis)

ElevenLabs is the industry standard for natural-sounding synthesis and voice cloning. A 30-second voice sample is enough to generate unlimited speech in that voice. Cost: under $1 per 30 seconds of output — compared to $50–$200/hour for a professional voice actor.

A 2025 study by Queen Mary University of London found AI-synthesized voices were indistinguishable from human voices 60% of the time in blind listening tests — even for trained listeners.

Voice cloning without consent is illegal in many jurisdictions and violates most platforms' terms of service. ElevenLabs requires consent verification for cloning voices of identifiable individuals.

Practical voice AI workflows

1Meeting transcription: Record → Whisper or Otter.ai → GPT-4o summary with action items. Free with local Whisper; ~$10/month with cloud tools.
2Podcast production: ElevenLabs Dubbing for translation into 29 languages, preserving original voice characteristics. Used by major podcast networks.
3Accessibility: Real-time captions for video calls using Azure Cognitive Services or Google Live Caption API.
4Content voiceover: ElevenLabs or OpenAI TTS for consistent narration. Always disclose AI voice usage to your audience.

Video generation: the state of the art

AI video generation advanced more in 2025 than in the previous five years combined. The market is consolidating around a few clear leaders, each with distinct strengths.

Sora (OpenAI) was discontinued in early 2026 following quality and safety concerns. OpenAI has not announced a replacement timeline. Avoid building workflows that depend on Sora.

Tool	Max Length	Resolution	Input Modes	Cost (approx.)	Best For
Runway Gen-4	10s	1080p	Text, image, video	$0.05/second	Cinematic shots, consistent characters
Kling 3.0	10s	1080p	Text, image	$0.04/second	Realistic motion, native multimodal training
Google Veo 2	8s	1080p	Text, image	Vertex AI pricing	Physics simulation, natural scenes
Pika Labs 2.0	5s	720p	Text, image	$15/mo (150 videos)	Quick iterations, beginner-friendly

What current video AI still gets wrong

Hands & fingers

Finger count, grip, and hand anatomy remain unreliable. A common tell in AI video.

Text in video

Rendered text (signs, labels, credits) is frequently garbled or changes between frames.

Physics & continuity

Objects pass through each other, gravity behaves incorrectly, and props appear/disappear between cuts.

Long-form coherence

Consistency across multiple clips (same character, same lighting, same environment) requires careful prompt engineering and tools like Runway's Act One.

Market context: Runway reported $300M in annualized revenue and a $5.3B valuation as of February 2026 (per Bloomberg). Kuaishou's Kling 3.0 (released February 2026) uses “unified multimodal training” — the model was trained on text, image, and video simultaneously rather than fine-tuned in stages, which significantly improves temporal coherence.

Real-world multimodal workflows

The highest-value applications combine multiple modalities in sequence. Here are four production-tested workflows used by teams today.

Invoice & receipt processing pipeline

Upload invoice image to GPT-4o with extraction prompt
Receive structured JSON (vendor, date, amount, line items)
Validate JSON against business rules (e.g., currency, VAT format)
Push to accounting system via API

Processing time: ~3 seconds vs 4–10 minutes manual. Accuracy: 96–98% on clean scans. Needs human review for poor-quality images.

Meeting intelligence pipeline

Record Zoom/Teams call via Grain, Fireflies, or similar
Whisper v3 transcribes with speaker diarization
GPT-4o extracts decisions, action items, owners, deadlines
Summary pushed to Notion, Slack, or email automatically

Used by teams ranging from 5 to 5,000 people. Key benefit: searchable meeting archive. Key risk: participants must be informed they are being recorded and processed.

Social media video production

Write a short script (GPT-4o or Claude)
Generate voiceover with ElevenLabs (consistent AI voice)
Generate B-roll clips with Runway Gen-4 or Kling
Assemble in CapCut, Premiere, or Descript
Add captions via WhisperKit or Descript auto-captions

Production time: 2–4 hours vs 2–3 days traditional. Cost: ~$20–50/video. Disclosure required by EU AI Act and FTC guidelines for sponsored content.

Product image QA at scale

Batch-upload product photos to vision API
Prompt: "Check for defects, correct labeling, and brand compliance. Return pass/fail with reason."
Flag failures for human review queue
Log results to QA dashboard

Reduces QA bottleneck in e-commerce and manufacturing. Not a replacement for regulated inspection (medical devices, food safety) — use as a triage filter only.

Risks & Responsible Use

Know these before you go further.

Deepfakes & non-consensual synthetic media

Voice cloning from 30 seconds of audio and face-swap video generation are now accessible to anyone with a credit card. These technologies are already used for fraud (CEO voice impersonation), political disinformation, and non-consensual intimate imagery.

What this means for you

Only clone voices or likenesses with explicit written consent. Apply watermarking or metadata standards (C2PA) to AI-generated media you publish. Check local laws — Germany (§ 201a StGB), UK, and several US states have enacted synthetic media legislation.

Vision hallucination & over-trust

Vision models confidently describe things that aren't there — especially in low-resolution images, partially obscured content, and out-of-distribution scenarios (unusual chart types, non-Latin scripts, medical imagery outside training distribution).

What this means for you

Always validate critical vision outputs against ground truth. Use low-resolution mode for classification tasks; high-resolution for reading text. Never use vision AI as sole input for safety-critical or legal decisions.

Copyright in AI-generated visual content

AI-generated images and video lack copyright protection in most jurisdictions (Thaler v. Perlmutter, 2025). However, the training data used to create these models — and the outputs they produce — may infringe on existing copyrights in ways that remain legally untested.

What this means for you

Review your tool's terms: Runway grants you commercial rights to outputs. Getty Images and Shutterstock offer AI generators trained on licensed data for higher commercial safety. Avoid prompts that explicitly reference specific copyrighted characters or styles.

Disclosure & transparency obligations

EU AI Act Article 50 requires disclosure when AI interacts with humans in voice or video (chatbots, AI avatars). US FTC guidelines require disclosure for AI-generated sponsored content. Failure to disclose synthetic voice or video in advertising is an emerging regulatory risk.

What this means for you

Add visible "AI-generated" labels to synthetic media. For voice AI in customer-facing products: announce "This is an AI assistant" at the start of interactions. Keep records of which content was AI-generated for at least 3 years.

Test your knowledge

Loading quiz...

Your 30-minute multimodal starter challenge

1Vision: Upload a screenshot, photo of a handwritten note, or product image to ChatGPT or Claude. Use the document extraction template above. Verify the output against the original.
2Transcription: Record a 2-minute voice memo. Run it through Whisper (free via whisper.ai or locally with whisper.cpp). Compare accuracy to what you said.
3Video: Sign up for Runway or Kling free tier. Generate a 5-second clip from a text prompt. Note: where does it fail? (Hands? Physics? Text?)

Reflect: For each modality — what would you trust this for in a professional context? What would you never use it for without human review?

Multimodal AI tools to explore

Apply what you've learned with these vision, voice, and video AI tools

ChatGPT

GPT-4o with vision — analyze images, documents, and screenshots.

Gemini

Google's multimodal model — strong on long video and multi-image reasoning.

Claude

Excellent for document layout, table extraction, and careful visual analysis.

ElevenLabs

Premium

Industry-leading voice synthesis and voice cloning.

Runway

Premium

Text-to-video and image-to-video generation up to 1080p.

Key Insights: What You've Learned

Multimodal AI processes images, audio, and video natively within a single transformer — vision models convert images into tokens (65 in low-res, up to 6,240 in high-res for GPT-4o) and your text prompt shapes which image regions the model attends to. Whisper achieves ~2.7% word error rate on English; ElevenLabs voice synthesis is indistinguishable from human voices 60% of the time (QMU 2025).

AI video generation (Runway Gen-4, Kling 3.0, Google Veo 2) is production-ready for short cinematic clips but still fails on hands, text in video, and long-form consistency. OpenAI Sora was discontinued in early 2026. Runway reported $300M ARR and a $5.3B valuation as of February 2026.

The EU AI Act (Article 50) requires disclosure whenever AI interacts with humans via voice or text in ways that could be mistaken for human interaction. Voice cloning without consent is illegal in multiple jurisdictions. Always label AI-generated media and keep records for at least 3 years.

Completed

You've completed Multimodal AI!