// guide

Accessible Media: Captions, Transcripts, and Audio Description

Video and audio lock out anyone who cannot hear or cannot see them unless you provide alternatives. This guide covers who needs captions, transcripts, and audio description, what each WCAG media criterion requires, and how to wire captions and descriptions into an HTML5 video with the track element and WebVTT.

beginner content

// 01 · who needs media alternatives

Who Needs Media Alternatives

A video or audio clip carries information through channels that not everyone can access. Three groups need alternatives, and each needs a different one:

Deaf and hard-of-hearing users cannot hear the audio, so they need the sound as text — captions synchronized with the video, and a transcript to read or search.
Blind and low-vision users cannot see the picture, so they need important visual information spoken — audio description narrated in the pauses, or a text alternative that includes it.
Deaf-blind users need a transcript that combines dialogue, sound, and described visuals into text they can read with a braille display.

These alternatives help far more people than the groups they are designed for: captions let anyone watch in a noisy café or a silent library, transcripts are searchable and skimmable, and both are indexed by search engines. Media accessibility is one of the clearest cases where the accessible version is simply the better product.

// 02 · what wcag requires

What WCAG Requires

The WCAG 1.2 "Time-based Media" criteria spell out what each media type needs. The Level A and AA rows below are the ones a standard conformance target must meet.

Criterion	Level	Requires
1.2.1 Audio-only & Video-only	A	A transcript for audio-only; an audio track or text for video-only
1.2.2 Captions (Prerecorded)	A	Synchronized captions for all prerecorded video with audio
1.2.3 Audio Description or Media Alternative	A	Audio description or a full text alternative for prerecorded video
1.2.4 Captions (Live)	AA	Real-time captions for live video with audio
1.2.5 Audio Description (Prerecorded)	AA	Audio description for all prerecorded video (text alternative no longer sufficient)

The short version for AA Prerecorded video with sound needs captions (1.2.2) and audio description (1.2.5). Audio-only content needs a transcript (1.2.1). Live video needs live captions (1.2.4). Everything beyond that — sign language, extended description — is AAA.

// 03 · captions vs. subtitles

Captions vs. Subtitles

The distinction trips people up constantly, and it matters for conformance. Captions are for viewers who cannot hear: they include dialogue plus speaker labels and meaningful non-speech sound. Subtitles assume you can hear and only translate the dialogue into another language. Subtitles alone do not satisfy WCAG, because they drop the sound information a deaf viewer depends on.

	Captions	Subtitles
Audience	Viewers who cannot hear the audio	Viewers who cannot understand the language
Includes dialogue	Yes	Yes
Includes sound & speaker labels	Yes — `[door slams]`, `Ana:`	No
Satisfies WCAG	Yes	No, on its own

"Closed" captions can be toggled on and off (the norm on the web); "open" captions are burned into the video and always visible. Prefer closed captions so users control them, and make sure they are accurate — correct words, punctuation, speaker changes, and sounds. Which brings us to the biggest real-world failure.

Auto-captions are a draft, not a deliverable Machine-generated captions miss punctuation, mishear names and jargon, and skip non-speech sound — the reason they are nicknamed "craptions." WCAG requires captions to be accurate and complete, so always edit auto-generated output before publishing.

// 04 · transcripts and audio description

Transcripts and Audio Description

Captions cover the audio for people who cannot hear. Two more alternatives cover the rest.

A transcript is the complete text of the media — every spoken word, plus speaker labels and descriptions of important sound and, ideally, the described visuals. It is the one alternative that serves deaf-blind users (via a braille display), and it doubles as skimmable, searchable, indexable text. For audio-only content like a podcast, a transcript is all WCAG 1.2.1 requires. Place it near the media, or link to it clearly.

Audio description addresses the opposite gap: information carried by the picture that a blind viewer misses. A narrator describes key visuals — on-screen text, actions, scene changes — during natural pauses in the dialogue. If your video shows something meaningful that the soundtrack never says out loud ("she frowns and shakes her head", "the chart drops to zero"), it needs description to meet 1.2.5. The best defense is to script and narrate videos so they explain their own visuals, reducing how much separate description is needed.

Design for self-description When you control the script, have speakers voice what they show — "as you can see in this red error banner at the top" instead of "as you can see here." Self-describing narration serves blind viewers without a separate description track and makes the content clearer for everyone.

// 05 · implementing captions in html

Implementing Captions in HTML

For self-hosted HTML5 video, captions and descriptions are added with <track> elements pointing at WebVTT files. Use kind="captions" (not subtitles), set srclang, and add default to enable a track by default.

<video controls width="640" preload="metadata">
  <source src="/talk.mp4" type="video/mp4">

  <!-- Captions: dialogue + sound, for deaf/HoH viewers -->
  <track kind="captions" src="/talk.en.vtt"
         srclang="en" label="English" default>

  <!-- Descriptions: visual info for blind viewers -->
  <track kind="descriptions" src="/talk.desc.en.vtt"
         srclang="en" label="English descriptions">
</video>

A WebVTT file is plain text: the header, then cues with start and end timestamps. Include speaker labels and non-speech sound in the caption text.

WEBVTT

00:00:00.500 --> 00:00:03.000
[upbeat music]

00:00:03.500 --> 00:00:07.200
Ana: Welcome to the accessibility talk.

00:00:07.500 --> 00:00:10.000
Today we're covering captions and transcripts.

Hosted platforms handle the plumbing If you publish through YouTube, Vimeo, or a video platform, you upload a caption file (or edit the auto-captions) and the player exposes the toggle for you — the same WebVTT/SRT content, without hand-writing the <track> markup. Either way, the accuracy of the caption text is on you.

// 06 · common mistakes

Common Mistakes

Shipping auto-captions unedited. Inaccurate machine captions fail 1.2.2. Treat them as a first draft and correct them.
Using subtitles instead of captions. Dialogue-only tracks omit sound and speaker labels — they do not meet the standard.
No transcript for a podcast. Audio-only content needs a transcript (1.2.1); a player alone is not enough.
Ignoring visual-only information. On-screen text or actions the soundtrack never mentions lock out blind viewers — add description or self-describe (1.2.3 / 1.2.5).
Autoplaying media with sound. Startling and disorienting; also interacts with the motion criteria. Let users start playback.
Captions that cover content or vanish too fast. Keep them readable — reasonable timing, good contrast, out of the way of on-screen text.

Where this fits Media alternatives are content work as much as code — run them through the content checklist and the developer checklist. For the text-alternative mindset applied to images, see the images and alt text guide; for the full standard, the WCAG 2.2 overview.

Frequently asked questions

What is the difference between captions and subtitles?

Captions are written for people who cannot hear the audio, so they include not just dialogue but also speaker identification and important non-speech sound — [phone ringing], [ominous music], [crowd cheering]. Subtitles assume you can hear and just need the dialogue translated into another language, so they carry speech only. For accessibility you need captions; subtitles alone do not satisfy WCAG because they omit the sound information a deaf viewer relies on.

Do I need both captions and a transcript?

For video, yes — they serve different needs. Captions are synchronized with the video so deaf and hard-of-hearing viewers can follow in real time. A transcript is the full text (including audio-described visual information) that deaf-blind users can read with a braille display, that anyone can skim or search, and that search engines can index. For audio-only content like a podcast, a transcript alone satisfies WCAG 1.2.1. For video, provide captions and, ideally, a transcript.

What is audio description?

Audio description is a narration track that describes important visual information a blind or low-vision viewer would otherwise miss — on-screen text, actions, scene changes, facial expressions — spoken during natural pauses in the dialogue. WCAG 1.2.3 (Level A) lets you meet the requirement with either audio description or a full text alternative for prerecorded video; 1.2.5 (Level AA) specifically requires audio description. If your video conveys meaning visually that the soundtrack does not, it needs description.

How do I add captions to an HTML5 video?

Add a <track> element inside your <video> pointing at a WebVTT (.vtt) file: <track kind="captions" src="captions.en.vtt" srclang="en" label="English" default>. Use kind="captions" (not subtitles) so non-speech sounds are included, set srclang to the language, and add default to turn them on by default if appropriate. The browser's native player then exposes a captions toggle.

Are auto-generated captions good enough for accessibility?

As a starting point only. Auto-captions from YouTube or similar tools routinely miss punctuation, mangle names and technical terms, drop speaker changes, and ignore non-speech sound — the "craptions" problem. WCAG requires captions to be accurate and complete, so treat machine output as a first draft and edit it: fix errors, add punctuation and speaker labels, and include meaningful sounds. Auto-captions left uncorrected do not meet the standard.