// guide
Accessible Media: Captions, Transcripts, and Audio Description
Video and audio lock out anyone who cannot hear or cannot see them unless you provide alternatives. This guide covers who needs captions, transcripts, and audio description, what each WCAG media criterion requires, and how to wire captions and descriptions into an HTML5 video with the track element and WebVTT.
// 01 · who needs media alternatives
Who Needs Media Alternatives
A video or audio clip carries information through channels that not everyone can access. Three groups need alternatives, and each needs a different one:
- Deaf and hard-of-hearing users cannot hear the audio, so they need the sound as text — captions synchronized with the video, and a transcript to read or search.
- Blind and low-vision users cannot see the picture, so they need important visual information spoken — audio description narrated in the pauses, or a text alternative that includes it.
- Deaf-blind users need a transcript that combines dialogue, sound, and described visuals into text they can read with a braille display.
These alternatives help far more people than the groups they are designed for: captions let anyone watch in a noisy café or a silent library, transcripts are searchable and skimmable, and both are indexed by search engines. Media accessibility is one of the clearest cases where the accessible version is simply the better product.
// 02 · what wcag requires
What WCAG Requires
The WCAG 1.2 "Time-based Media" criteria spell out what each media type needs. The Level A and AA rows below are the ones a standard conformance target must meet.
| Criterion | Level | Requires |
|---|---|---|
| 1.2.1 Audio-only & Video-only | A | A transcript for audio-only; an audio track or text for video-only |
| 1.2.2 Captions (Prerecorded) | A | Synchronized captions for all prerecorded video with audio |
| 1.2.3 Audio Description or Media Alternative | A | Audio description or a full text alternative for prerecorded video |
| 1.2.4 Captions (Live) | AA | Real-time captions for live video with audio |
| 1.2.5 Audio Description (Prerecorded) | AA | Audio description for all prerecorded video (text alternative no longer sufficient) |
// 03 · captions vs. subtitles
Captions vs. Subtitles
The distinction trips people up constantly, and it matters for conformance. Captions are for viewers who cannot hear: they include dialogue plus speaker labels and meaningful non-speech sound. Subtitles assume you can hear and only translate the dialogue into another language. Subtitles alone do not satisfy WCAG, because they drop the sound information a deaf viewer depends on.
| Captions | Subtitles | |
|---|---|---|
| Audience | Viewers who cannot hear the audio | Viewers who cannot understand the language |
| Includes dialogue | Yes | Yes |
| Includes sound & speaker labels | Yes — [door slams], Ana: |
No |
| Satisfies WCAG | Yes | No, on its own |
"Closed" captions can be toggled on and off (the norm on the web); "open" captions are burned into the video and always visible. Prefer closed captions so users control them, and make sure they are accurate — correct words, punctuation, speaker changes, and sounds. Which brings us to the biggest real-world failure.
// 04 · transcripts and audio description
Transcripts and Audio Description
Captions cover the audio for people who cannot hear. Two more alternatives cover the rest.
A transcript is the complete text of the media — every spoken word, plus speaker labels and descriptions of important sound and, ideally, the described visuals. It is the one alternative that serves deaf-blind users (via a braille display), and it doubles as skimmable, searchable, indexable text. For audio-only content like a podcast, a transcript is all WCAG 1.2.1 requires. Place it near the media, or link to it clearly.
Audio description addresses the opposite gap: information carried by the picture that a blind viewer misses. A narrator describes key visuals — on-screen text, actions, scene changes — during natural pauses in the dialogue. If your video shows something meaningful that the soundtrack never says out loud ("she frowns and shakes her head", "the chart drops to zero"), it needs description to meet 1.2.5. The best defense is to script and narrate videos so they explain their own visuals, reducing how much separate description is needed.
// 05 · implementing captions in html
Implementing Captions in HTML
For self-hosted HTML5 video, captions and descriptions are added with <track> elements pointing at WebVTT files. Use kind="captions" (not subtitles), set srclang, and add default to enable a track by default.
<video controls width="640" preload="metadata">
<source src="/talk.mp4" type="video/mp4">
<!-- Captions: dialogue + sound, for deaf/HoH viewers -->
<track kind="captions" src="/talk.en.vtt"
srclang="en" label="English" default>
<!-- Descriptions: visual info for blind viewers -->
<track kind="descriptions" src="/talk.desc.en.vtt"
srclang="en" label="English descriptions">
</video>
A WebVTT file is plain text: the header, then cues with start and end timestamps. Include speaker labels and non-speech sound in the caption text.
WEBVTT
00:00:00.500 --> 00:00:03.000
[upbeat music]
00:00:03.500 --> 00:00:07.200
Ana: Welcome to the accessibility talk.
00:00:07.500 --> 00:00:10.000
Today we're covering captions and transcripts.
<track> markup. Either way, the accuracy of the caption text is on you.
// 06 · common mistakes
Common Mistakes
- Shipping auto-captions unedited. Inaccurate machine captions fail 1.2.2. Treat them as a first draft and correct them.
- Using subtitles instead of captions. Dialogue-only tracks omit sound and speaker labels — they do not meet the standard.
- No transcript for a podcast. Audio-only content needs a transcript (1.2.1); a player alone is not enough.
- Ignoring visual-only information. On-screen text or actions the soundtrack never mentions lock out blind viewers — add description or self-describe (1.2.3 / 1.2.5).
- Autoplaying media with sound. Startling and disorienting; also interacts with the motion criteria. Let users start playback.
- Captions that cover content or vanish too fast. Keep them readable — reasonable timing, good contrast, out of the way of on-screen text.
Frequently asked questions
What is the difference between captions and subtitles?
Captions are written for people who cannot hear the audio, so they include not just dialogue but also speaker identification and important non-speech sound — [phone ringing], [ominous music], [crowd cheering]. Subtitles assume you can hear and just need the dialogue translated into another language, so they carry speech only. For accessibility you need captions; subtitles alone do not satisfy WCAG because they omit the sound information a deaf viewer relies on.
Do I need both captions and a transcript?
For video, yes — they serve different needs. Captions are synchronized with the video so deaf and hard-of-hearing viewers can follow in real time. A transcript is the full text (including audio-described visual information) that deaf-blind users can read with a braille display, that anyone can skim or search, and that search engines can index. For audio-only content like a podcast, a transcript alone satisfies WCAG 1.2.1. For video, provide captions and, ideally, a transcript.
What is audio description?
Audio description is a narration track that describes important visual information a blind or low-vision viewer would otherwise miss — on-screen text, actions, scene changes, facial expressions — spoken during natural pauses in the dialogue. WCAG 1.2.3 (Level A) lets you meet the requirement with either audio description or a full text alternative for prerecorded video; 1.2.5 (Level AA) specifically requires audio description. If your video conveys meaning visually that the soundtrack does not, it needs description.
How do I add captions to an HTML5 video?
Add a <track> element inside your <video> pointing at a WebVTT (.vtt) file: <track kind="captions" src="captions.en.vtt" srclang="en" label="English" default>. Use kind="captions" (not subtitles) so non-speech sounds are included, set srclang to the language, and add default to turn them on by default if appropriate. The browser's native player then exposes a captions toggle.
Are auto-generated captions good enough for accessibility?
As a starting point only. Auto-captions from YouTube or similar tools routinely miss punctuation, mangle names and technical terms, drop speaker changes, and ignore non-speech sound — the "craptions" problem. WCAG requires captions to be accurate and complete, so treat machine output as a first draft and edit it: fix errors, add punctuation and speaker labels, and include meaningful sounds. Auto-captions left uncorrected do not meet the standard.