Best Subtitles & Speech-to-Text APIs in 2026

Esa Landicho

Best Subtitles & Speech-to-Text APIs in 2026

Getting text from audio is a solved problem. Every major cloud provider has a transcription API, pricing is low, and integration takes an afternoon. The problem is that your users don't want text: they want a video with subtitles on it.

Between getting a transcript and getting styled, branded, burned-in subtitles on a video file, there are five more engineering problems: timestamp alignment, line break formatting, visual styling, rendering, and output. Most teams underestimate this until they're in production.

This comparison covers the full market: pure speech-to-text APIs (Deepgram, AssemblyAI, Whisper, Speechmatics) and styled subtitle APIs that output finished video (ZapCap, Submagic, Captions/Mirage, Shotstack, Creatomate, and VEED's Subtitle Styling API). Both categories, real pricing, honest tradeoffs.

Key takeaways:

There are two distinct categories: pure speech-to-text APIs that output text, and subtitle APIs that output styled video.
Best pure STT options: Deepgram (speed), AssemblyAI (accuracy and audio intelligence), Whisper (lowest cost), Speechmatics (multilingual).
Best styled subtitle APIs: ZapCap, Submagic, and Captions/Mirage all deliver animated captions with burned-in output.
Pure STT APIs stop at the transcript. Getting to a styled video requires five more steps: timing, formatting, styling, burn-in, and rendering.
VEED's Subtitle Styling API handles the full chain in one call, using VEED's own preset library.

Speech-to-text API vs. subtitle API: what's the difference?

Step	Pure STT API	Subtitle / video API
1 Transcription	Yes	Yes
2 Timestamp alignment	Partial Word timestamps only	Yes
3 Subtitle formatting (line breaks, char limits)	You build this	Yes
4 Visual styling (font, colour, animation preset)	You build this	Yes
5 Burn-in and render (embed into video)	You build this	Yes
6 Output video file	You manage this	Yes

If you're building a voice agent or live captioning system, a pure STT API is the right fit. If you're automating subtitles on video content, a subtitle API handles the parts that matter to your users and saves you weeks of rendering pipeline work.

Best speech-to-text APIs (transcription only)

API	Best for	Latency	Accuracy	Price/min	Languages	Real-time?
Deepgram Nova-3	Voice agents, speed	⚡ <300ms	~5% WER	~$0.022	30+	✓ Yes
AssemblyAI Universal-2	Accuracy + intelligence	Competitive	~8.4% WER	~$0.006	100+	✓ Yes (streaming)
OpenAI Whisper	Cost, open source	Batch only	Best benchmarks	$0.006 API Free self-hosted	99+	✕ No
Speechmatics	Multilingual, on-premise	Sub-second	Strong	480 free min/mo	55+	✓ Yes
Google Cloud	GCP users, 100+ languages	Competitive	Good	Tiered 60 free min/mo	100+	✓ Yes
Azure Speech	Enterprise compliance	Competitive	Good	Tiered 5 free hrs/mo	100+	✓ Yes

Deepgram: fastest, best for voice agents

‼️ Verdict: Sub-300ms latency with Nova-3. Best for real-time voice applications. Bundled Voice Agent API at $4.50/hr removes cost surprises.

Deepgram Nova-3 delivers real-time streaming transcription under 300ms, which is essential for voice agents where delays above 500ms feel unnatural. The Voice Agent API bundles STT, LLM orchestration, and TTS at a flat $4.50/hour, making cost predictable at scale. For batch video transcription, the async API handles large files efficiently. Text out only: if you need subtitles on a video, Deepgram is step one of six.

Best for: Real-time voice agents, live captioning, conversational AI.

Pricing: Batch ~$0.022/min; Voice Agent API $4.50/hr bundled. Verify at deepgram.com/pricing.

AssemblyAI: best accuracy and audio intelligence

‼️ Verdict: Universal-2 leads on accuracy with ~8.4% WER and 30% fewer hallucinations than Whisper Large-v3. Best for accuracy-critical workflows and audio intelligence features.

AssemblyAI Universal-2 delivers best-in-class accuracy on real-world audio, not just clean benchmarks. Beyond transcription, it adds sentiment analysis, PII redaction, topic detection, and speaker diarization in the same API call, making it the richest platform for applications where text quality matters downstream. AssemblyAI is also the transcription provider powering VEED's own online video editor, a practical signal of production-grade reliability on real video content.

Best for: Accuracy-critical content, audio intelligence, multi-speaker video.

Pricing: Batch ~$0.006/min ($0.37/hr); streaming $0.45/hr. Verify at assemblyai.com/pricing.

OpenAI Whisper: cheapest and open source

‼️Verdict: Top accuracy in clean-audio benchmarks. $0.006/min via API, free if self-hosted. No real-time streaming.

Whisper holds top position on clean-audio accuracy benchmarks and supports 99+ languages without fine-tuning. At $0.006/min it's the cheapest managed transcription API available. The limitations are well-established: no native real-time streaming (batching workarounds add latency), and self-hosting the largest model requires GPU infrastructure. For cost-optimized batch transcription of video content where speed isn't critical, Whisper is the default.

Best for: Cost-optimized batch transcription, multilingual, open source.

Pricing: $0.006/min via OpenAI API; free self-hosted (GPU cost). Verify at openai.com/api/pricing.

Speechmatics: best for multilingual and on-premise

‼️ Verdict: 55+ languages with strong accent handling. Best deployment flexibility: cloud, on-premise, edge. 480 free minutes per month.

Speechmatics differentiates on language breadth and deployment flexibility. With 55+ languages and strong performance on diverse accents (particularly British English and regional dialects), it's the go-to for content teams targeting non-English markets. On-premise and edge deployment options make it the natural choice for regulated industries with data residency requirements. The 480 free minutes per month is the most generous trial tier in the STT category.

Best for: Multilingual content, diverse accents, on-premise or regulated deployment.

Pricing: 480 free min/mo; enterprise pricing via contact. Verify at speechmatics.com/pricing.

Best subtitle APIs: styled video output

These APIs accept a video file and return a video with subtitles burned in. They handle transcription, formatting, styling, and rendering: the full chain that pure STT APIs leave to you. If your end goal is a post-ready video, start here.

ZapCap: animated templates, strong styling

Verdict: Strong competitive option. Animated templates, word-highlight, emojis, custom fonts. PAYG at $0.10/min (split: $0.03 transcription + $0.07 rendering).

ZapCap is one of the most popular subtitle-focused APIs and a direct competitor to VEED's Subtitle Styling API. It supports animated caption templates, word-highlight (karaoke-style), emoji overlays, and custom fonts: the style features that drive engagement on social video. Pricing is transparent: $0.03/min for transcription, $0.07/min for rendering, totaling $0.10/min PAYG with 2x for 4K output.

Best for: Developers who need animated, social-ready subtitles with strong style options.

Pricing: $0.10/min PAYG ($0.03 transcript + $0.07 render); 2x rate for 4K. Verify at ZapCap's pricing page.

Submagic: animated captions with B-roll and silence removal

Verdict: Richest feature set in the subtitle API category: B-roll generation, silence removal, keyword highlights, emojis. Higher price on subscription; competitive PAYG add-on.

Submagic goes beyond subtitles into broader content automation: animated captions, keyword highlights, emoji overlays, B-roll integration, and silence removal in the same pipeline. The API is available as a PAYG credit add-on to Business+ plans ($41/month base), with per-minute pricing from $0.10 to $0.15. The subscription requirement adds friction for developers looking for pure PAYG access, but the feature set is the broadest in this category.

Best for: Social media content automation requiring captions plus B-roll and audio cleanup.

Pricing: Business+ plan required ($41/mo); PAYG API credits $0.10 to $0.15/min. Verify at Submagic's pricing page.

Captions / Mirage: best-looking animated caption templates

Verdict: Known for the highest-quality animated caption templates. $0.15/min PAYG. Strong styling quality.

Captions (also known as Mirage for the API product) has built a strong reputation for caption template quality. The animations, transitions, and visual polish are consistently cited as best-in-category. At $0.15/min PAYG it's priced at a slight premium to ZapCap, justified by template quality. The API applies styled captions to video files and returns a rendered output.

Best for: Premium animated subtitles where visual quality is the priority.

Pricing: $0.15/min PAYG. Verify at Captions' pricing page.

Shotstack and Creatomate: general video APIs with subtitle support

Verdict: Not subtitle-specialists. Subtitles are a feature within a broader video editing API. More flexible but less opinionated on caption styling.

Shotstack and Creatomate are general-purpose video editing APIs where subtitles are one capability among many. Both support SRT/VTT burn-in, custom fonts, colors, and position via JSON configuration. Creatomate supports word-by-word animated captions and custom templates via its template editor. Neither has the animated social-native presets of ZapCap or Submagic, but their broader scope makes them better fits for teams who need subtitle generation as part of a wider video production pipeline.

Best for: Teams needing subtitle generation as part of a full video editing workflow.

Pricing: Shotstack: PAYG $0.07 to $0.40/min; subscription $0.04 to $0.20/min. Creatomate: subscription only, $0.33 to $0.54/min. Verify on their pricing pages.

VEED's Subtitle Styling API: the full pipeline option

VEED's Subtitle Styling API is now available. Details reflect the confirmed product spec. The API is not yet publicly available.

Every API above, whether STT or subtitle-focused, represents part of the workflow. VEED's Subtitle Styling API, launching Q2 2026, is built to handle the complete chain in a single call.

Versus pure STT APIs (Deepgram, AssemblyAI, Whisper): VEED doesn't stop at the transcript. It applies visual styling, renders the subtitles into the video file, and returns a finished, post-ready output. No rendering pipeline to build, no style system to maintain.

Versus subtitle-focused APIs (ZapCap, Submagic, Captions/Mirage): VEED's style presets are the same ones used in VEED's own editor, refined across millions of videos by real creators. The quality bar is set by VEED's own product, not a generic template library. And because VEED handles the full video pipeline (lip sync, background removal, AI generation, and now subtitles) teams can automate a complete localization and production workflow through one API.

How it works

1. Send a video URL and select a style preset

2. VEED transcribes the audio using enterprise-grade speech recognition

3. The Subtitle Styling API applies the preset: timing, formatting, styling, animations

4. VEED's render engine burns the subtitles into the video

5. Poll for the result and receive a finished, styled video

Teams with their own SRT file can skip transcription entirely: useful when auto-transcription output needs correction, or when translation has already been handled separately.

VEED's video API suite (available now)

VEED API overview: full documentation and API access
Lip Sync API: sync translated audio to video in 35+ languages
Background Remover API: remove backgrounds at scale
Fabric 1.0 API: generate AI video from a text prompt
Subtitle Styling API: Subtitles API for automatic subtitle generation and burned-in styling

Best for: Automated video subtitle generation: styled, burned-in, post-ready, without managing a rendering pipeline.

Current limitations: Preset styles only at launch (no custom fonts or colors). No translation, SRT/VTT export, or webhooks in MVP.

Full pricing comparison: STT and subtitle APIs

API	Free tier	Batch/PAYG	Subscription	Self-hosted?	Notes
Deepgram	$200 credit	~$0.022/min	—	Enterprise	Fastest; voice agent bundle $4.50/hr
AssemblyAI	~$50 credit	~$0.006/min	~$0.37/hr	No	Best audio intelligence
OpenAI Whisper	—	$0.006/min	—	Yes (free)	Cheapest managed; no streaming
Speechmatics	480 min/mo	Contact sales	Contact sales	Enterprise	Best multilingual
Google Cloud	60 min/mo	Tiered	—	No	100+ languages; complex setup
Azure Speech	5 hrs/mo	Tiered	—	Limited	Enterprise compliance
ZapCap (subtitle)	—	$0.10/min	—	No	Animated templates; $0.03 transcript + $0.07 render
Submagic (subtitle)	—	$0.10–$0.15/min	$0.41/min base	No	Business+ tier req'd for API
Captions / Mirage (subtitle)	—	$0.15/min	—	No	Animated captions; burn-in only
Shotstack (video API)	—	$0.07–$0.40/min	$0.04–$0.20/min	No	Full video API; subtitles as a feature
Creatomate (video API)	—	—	$0.33–$0.54/min	No	Subscription only; animated captions
VEED Subtitle API	TBC	Duration-based	—	No	Full pipeline incl. styling + render

‍Source: VEED internal pricing benchmark + official provider pricing pages, April 2026. Verify all pricing before publishing. Subtitle API prices include transcription + rendering where applicable.

How to choose: decision framework

If you need…	Choose	Why
Real-time voice agent or conversational AI	Deepgram	Sub-300ms latency; bundled Voice Agent API
Best raw accuracy + audio intelligence	AssemblyAI	~8.4% WER; richest features; VEED's chosen provider
Lowest cost transcription (text out)	OpenAI Whisper	$0.006/min API or free self-hosted
Multilingual, diverse accents, on-premise	Speechmatics	55+ languages; flexible deployment; 480 free min/mo
Animated, social-native styled subtitles	ZapCap or Submagic	Best styling; $0.10–$0.15/min PAYG
Highest-quality animated caption templates	Captions / Mirage	Premium template quality; $0.15/min
Subtitles as part of a wider video workflow	Shotstack or Creatomate	General video API; flexible pipeline
Styled subtitles + full video pipeline in one API	VEED VEED Subtitle Styling API Q2 2026	Only option covering transcription → style → render + VEED's full suite

Recap and final thoughts

‍

Here's what to remember:

Two different categories: pure STT APIs output text; styled subtitle APIs output video. Know which one your use case actually needs before evaluating pricing.
For voice agents and live captioning: Deepgram leads on speed; AssemblyAI leads on accuracy and features; Whisper leads on cost.
For styled, burned-in subtitles: ZapCap ($0.10/min), Submagic ($0.10–$0.15/min), and Captions/Mirage ($0.15/min) are the current market. All animate well for social video.
VEED adds the layer they all miss: VEED's Subtitle Styling API handles the full pipeline — transcription, styling with VEED's own preset library, and render — in one call, and integrates with lip sync, background removal, and AI video generation in the same platform.
Pricing ranges from free to $0.54/min: calculate on your actual volume, not the headline rate. Add-ons, resolutions, and subscription requirements change the real number significantly.

‍

Faq

What is the best speech-to-text API in 2026?

It depends on your use case. Deepgram Nova-3 leads for real-time voice agents (sub-300ms latency). AssemblyAI Universal-2 leads for accuracy and audio intelligence. OpenAI Whisper leads on cost at $0.006/min. If your goal is styled subtitles burned into video rather than raw transcription, ZapCap, Submagic, and Captions/Mirage are the dedicated subtitle API category, with VEED's Subtitle Styling API covering the full pipeline including VEED's own preset library.

What is the cheapest transcription API?

OpenAI Whisper is the cheapest at $0.006/min via the API, or free if self-hosted (GPU compute cost only). Speechmatics includes 480 free minutes per month, the most generous free tier. AssemblyAI is also $0.006/min for batch processing. Styled subtitle APIs are more expensive ($0.10 to $0.15/min) because they include rendering in the price.

What is the difference between a transcription API and a subtitle API?

A transcription API converts speech to text and stops there. A subtitle API handles the full pipeline: transcription, timestamp alignment, subtitle line formatting, visual styling, and burning the subtitles into the video file. The six-step gap between 'getting text' and 'getting a styled video' is where subtitle-focused APIs like ZapCap, Submagic, and Captions/Mirage operate: where VEED's Subtitle Styling API, now available, enters the market.

Which subtitle API has the best-looking animations?

Captions/Mirage is generally cited for the highest-quality animated caption templates. ZapCap and Submagic are close behind with strong word-highlight and animated presets well-suited to social video. Shotstack and Creatomate support basic subtitle styling via JSON, flexible but less visually opinionated. VEED's Subtitle Styling API will apply VEED's own preset library, the same styles used in VEED's online video editor, refined across millions of creator videos.

Is there a free speech-to-text or subtitle API?

For transcription: Speechmatics includes 480 free minutes per month. Google Cloud gives 60 free minutes. Azure Speech gives 5 free hours. OpenAI Whisper is free to self-host. For styled subtitle APIs, most require PAYG or a subscription with no meaningful free tier. ZapCap, Submagic, and Captions/Mirage are all paid, though most offer a limited free trial for evaluation.

How does VEED's Subtitle Styling API work?

VEED's Subtitle Styling API accepts a video URL and a style preset, then handles transcription, subtitle formatting, visual styling, and rendering: returning a video file with VEED's styled subtitles burned in. You can also supply your own SRT file to skip transcription. Billing is duration-based per minute of input video. If you already have the other VEED APIs connected (Lip Sync, Background Remover, Fabric 1.0), the subtitle step integrates into the same pipeline. Documentation at veed.io/api.

When it comes to amazing videos, all you need is VEED

Create your first video

No credit card required

Product

Create

Edit

Publish

what's new

Recorder

Video Editor

Captions & Translations

Publish

Create

Edit

Publish

Recorder

Video Editor

Captions & Translations

Publish

use cases

Marketing

Training

Communication

Sales

Sales

By Company Size

Marketing

Training

Sales

Communication

Marketing

Training

Communication

Sales

Marketing

Training

Sales

Communication

Ai

Avatars & AI Voices

AI Editing

AI Generation

Text to Video

Voice & Dubbing

AI Editing

Avatars & AI Voices

AI Editing

AI Generation

Text to Video

Voice & Dubbing

AI Editing

AI Video APis

Learn

Inspiration

Best Subtitles & Speech-to-Text APIs in 2026

Key takeaways:

Speech-to-text API vs. subtitle API: what's the difference?

Best speech-to-text APIs (transcription only)

Deepgram: fastest, best for voice agents

AssemblyAI: best accuracy and audio intelligence

OpenAI Whisper: cheapest and open source

Speechmatics: best for multilingual and on-premise

Best subtitle APIs: styled video output

ZapCap: animated templates, strong styling

Submagic: animated captions with B-roll and silence removal

Captions / Mirage: best-looking animated caption templates

Shotstack and Creatomate: general video APIs with subtitle support

VEED's Subtitle Styling API: the full pipeline option

How it works

VEED's video API suite (available now)

Full pricing comparison: STT and subtitle APIs

How to choose: decision framework

Recap and final thoughts

Faq

Read more

Launching VEED’s Lipsync API: World's Most Powerful Lip-Syncing Tech

Launching VEED’s Lipsync API: World's Most Powerful Lip-Syncing Tech

Best Lip-Sync API for AI Video: How VEED Fabric 1.0 Compares to the World's Top Models (2026)

Best Lip-Sync API for AI Video: How VEED Fabric 1.0 Compares to the World's Top Models (2026)

Launching VEED Fabric 1.0 API: World's First-Ever AI Talking Video Model

Launching VEED Fabric 1.0 API: World's First-Ever AI Talking Video Model

When it comes to amazing videos, all you need is VEED