Best Subtitles & Speech-to-Text APIs in 2026
by

Best Subtitles & Speech-to-Text APIs in 2026

Text to Speech

Getting text from audio is a solved problem. Every major cloud provider has a transcription API, pricing is low, and integration takes an afternoon. The problem is that your users don't want text: they want a video with subtitles on it.

Between getting a transcript and getting styled, branded, burned-in subtitles on a video file, there are five more engineering problems: timestamp alignment, line break formatting, visual styling, rendering, and output. Most teams underestimate this until they're in production.

This comparison covers the full market: pure speech-to-text APIs (Deepgram, AssemblyAI, Whisper, Speechmatics) and styled subtitle APIs that output finished video (ZapCap, Submagic, Captions/Mirage, Shotstack, Creatomate, and VEED's upcoming Subtitle Styling API). Both categories, real pricing, honest tradeoffs.

Key takeaways:

  • There are two distinct categories: pure speech-to-text APIs that output text, and subtitle APIs that output styled video.
  • Best pure STT options: Deepgram (speed), AssemblyAI (accuracy and audio intelligence), Whisper (lowest cost), Speechmatics (multilingual).
  • Best styled subtitle APIs: ZapCap, Submagic, and Captions/Mirage all deliver animated captions with burned-in output.
  • Pure STT APIs stop at the transcript. Getting to a styled video requires five more steps: timing, formatting, styling, burn-in, and rendering.
  • VEED's Subtitle Styling API (Q2 2026) handles the full chain in one call, using VEED's own preset library.

Speech-to-text API vs. subtitle API: what's the difference?

Step Pure STT API Subtitle / video API
1 Transcription
Yes Yes
2 Timestamp alignment
Partial Word timestamps only Yes
3 Subtitle formatting (line breaks, char limits)
You build this Yes
4 Visual styling (font, colour, animation preset)
You build this Yes
5 Burn-in and render (embed into video)
You build this Yes
6 Output video file
You manage this Yes

If you're building a voice agent or live captioning system, a pure STT API is the right fit. If you're automating subtitles on video content, a subtitle API handles the parts that matter to your users and saves you weeks of rendering pipeline work.

Best speech-to-text APIs (transcription only)

API Best for Latency Accuracy Price/min Languages Real-time?
Deepgram Nova-3 Voice agents, speed ⚡ <300ms ~5% WER ~$0.022 30+ ✓ Yes
AssemblyAI Universal-2 Accuracy + intelligence Competitive ~8.4% WER ~$0.006 100+ ✓ Yes (streaming)
OpenAI Whisper Cost, open source Batch only Best benchmarks $0.006 API Free self-hosted 99+ ✕ No
Speechmatics Multilingual, on-premise Sub-second Strong 480 free min/mo 55+ ✓ Yes
Google Cloud GCP users, 100+ languages Competitive Good Tiered 60 free min/mo 100+ ✓ Yes
Azure Speech Enterprise compliance Competitive Good Tiered 5 free hrs/mo 100+ ✓ Yes

Deepgram: fastest, best for voice agents

‼️ Verdict: Sub-300ms latency with Nova-3. Best for real-time voice applications. Bundled Voice Agent API at $4.50/hr removes cost surprises.

Deepgram Nova-3 delivers real-time streaming transcription under 300ms, which is essential for voice agents where delays above 500ms feel unnatural. The Voice Agent API bundles STT, LLM orchestration, and TTS at a flat $4.50/hour, making cost predictable at scale. For batch video transcription, the async API handles large files efficiently. Text out only: if you need subtitles on a video, Deepgram is step one of six.

Best for: Real-time voice agents, live captioning, conversational AI.

Pricing: Batch ~$0.022/min; Voice Agent API $4.50/hr bundled. Verify at deepgram.com/pricing.

AssemblyAI: best accuracy and audio intelligence

‼️ Verdict: Universal-2 leads on accuracy with ~8.4% WER and 30% fewer hallucinations than Whisper Large-v3. Best for accuracy-critical workflows and audio intelligence features.

AssemblyAI Universal-2 delivers best-in-class accuracy on real-world audio, not just clean benchmarks. Beyond transcription, it adds sentiment analysis, PII redaction, topic detection, and speaker diarization in the same API call, making it the richest platform for applications where text quality matters downstream. AssemblyAI is also the transcription provider powering VEED's own online video editor, a practical signal of production-grade reliability on real video content.

Best for: Accuracy-critical content, audio intelligence, multi-speaker video.

Pricing: Batch ~$0.006/min ($0.37/hr); streaming $0.45/hr. Verify at assemblyai.com/pricing.

OpenAI Whisper: cheapest and open source

‼️Verdict: Top accuracy in clean-audio benchmarks. $0.006/min via API, free if self-hosted. No real-time streaming.

Whisper holds top position on clean-audio accuracy benchmarks and supports 99+ languages without fine-tuning. At $0.006/min it's the cheapest managed transcription API available. The limitations are well-established: no native real-time streaming (batching workarounds add latency), and self-hosting the largest model requires GPU infrastructure. For cost-optimized batch transcription of video content where speed isn't critical, Whisper is the default.

Best for: Cost-optimized batch transcription, multilingual, open source.

Pricing: $0.006/min via OpenAI API; free self-hosted (GPU cost). Verify at openai.com/api/pricing.

Speechmatics: best for multilingual and on-premise

‼️ Verdict: 55+ languages with strong accent handling. Best deployment flexibility: cloud, on-premise, edge. 480 free minutes per month.

Speechmatics differentiates on language breadth and deployment flexibility. With 55+ languages and strong performance on diverse accents (particularly British English and regional dialects), it's the go-to for content teams targeting non-English markets. On-premise and edge deployment options make it the natural choice for regulated industries with data residency requirements. The 480 free minutes per month is the most generous trial tier in the STT category.

Best for: Multilingual content, diverse accents, on-premise or regulated deployment.

Pricing: 480 free min/mo; enterprise pricing via contact. Verify at speechmatics.com/pricing.

Best subtitle APIs: styled video output

These APIs accept a video file and return a video with subtitles burned in. They handle transcription, formatting, styling, and rendering: the full chain that pure STT APIs leave to you. If your end goal is a post-ready video, start here.

ZapCap: animated templates, strong styling

Verdict: Strong competitive option. Animated templates, word-highlight, emojis, custom fonts. PAYG at $0.10/min (split: $0.03 transcription + $0.07 rendering).

ZapCap is one of the most popular subtitle-focused APIs and a direct competitor to VEED's upcoming Subtitle Styling API. It supports animated caption templates, word-highlight (karaoke-style), emoji overlays, and custom fonts: the style features that drive engagement on social video. Pricing is transparent: $0.03/min for transcription, $0.07/min for rendering, totaling $0.10/min PAYG with 2x for 4K output.

Best for: Developers who need animated, social-ready subtitles with strong style options.

Pricing: $0.10/min PAYG ($0.03 transcript + $0.07 render); 2x rate for 4K. Verify at ZapCap's pricing page.

Submagic: animated captions with B-roll and silence removal

Verdict: Richest feature set in the subtitle API category: B-roll generation, silence removal, keyword highlights, emojis. Higher price on subscription; competitive PAYG add-on.

Submagic goes beyond subtitles into broader content automation: animated captions, keyword highlights, emoji overlays, B-roll integration, and silence removal in the same pipeline. The API is available as a PAYG credit add-on to Business+ plans ($41/month base), with per-minute pricing from $0.10 to $0.15. The subscription requirement adds friction for developers looking for pure PAYG access, but the feature set is the broadest in this category.

Best for: Social media content automation requiring captions plus B-roll and audio cleanup.

Pricing: Business+ plan required ($41/mo); PAYG API credits $0.10 to $0.15/min. Verify at Submagic's pricing page.

Captions / Mirage: best-looking animated caption templates

Verdict: Known for the highest-quality animated caption templates. $0.15/min PAYG. Strong styling quality.

Captions (also known as Mirage for the API product) has built a strong reputation for caption template quality. The animations, transitions, and visual polish are consistently cited as best-in-category. At $0.15/min PAYG it's priced at a slight premium to ZapCap, justified by template quality. The API applies styled captions to video files and returns a rendered output.

Best for: Premium animated subtitles where visual quality is the priority.

Pricing: $0.15/min PAYG. Verify at Captions' pricing page.

Shotstack and Creatomate: general video APIs with subtitle support

Verdict: Not subtitle-specialists. Subtitles are a feature within a broader video editing API. More flexible but less opinionated on caption styling.

Shotstack and Creatomate are general-purpose video editing APIs where subtitles are one capability among many. Both support SRT/VTT burn-in, custom fonts, colors, and position via JSON configuration. Creatomate supports word-by-word animated captions and custom templates via its template editor. Neither has the animated social-native presets of ZapCap or Submagic, but their broader scope makes them better fits for teams who need subtitle generation as part of a wider video production pipeline.

Best for: Teams needing subtitle generation as part of a full video editing workflow.

Pricing: Shotstack: PAYG $0.07 to $0.40/min; subscription $0.04 to $0.20/min. Creatomate: subscription only, $0.33 to $0.54/min. Verify on their pricing pages.

VEED's Subtitle Styling API: the full pipeline option

Coming Q2 2026: VEED's Subtitle Styling API is in development. Details reflect the confirmed product spec. The API is not yet publicly available.

Every API above, whether STT or subtitle-focused, represents part of the workflow. VEED's Subtitle Styling API, launching Q2 2026, is built to handle the complete chain in a single call.

Versus pure STT APIs (Deepgram, AssemblyAI, Whisper): VEED doesn't stop at the transcript. It applies visual styling, renders the subtitles into the video file, and returns a finished, post-ready output. No rendering pipeline to build, no style system to maintain.

Versus subtitle-focused APIs (ZapCap, Submagic, Captions/Mirage): VEED's style presets are the same ones used in VEED's own editor, refined across millions of videos by real creators. The quality bar is set by VEED's own product, not a generic template library. And because VEED handles the full video pipeline (lip sync, background removal, AI generation, and now subtitles) teams can automate a complete localization and production workflow through one API.

How it works

1. Send a video URL and select a style preset

2. VEED transcribes the audio using enterprise-grade speech recognition

3. The Subtitle Styling API applies the preset: timing, formatting, styling, animations

4. VEED's render engine burns the subtitles into the video

5. Poll for the result and receive a finished, styled video

Teams with their own SRT file can skip transcription entirely: useful when auto-transcription output needs correction, or when translation has already been handled separately.

VEED's video API suite (available now)

Best for: Automated video subtitle generation: styled, burned-in, post-ready, without managing a rendering pipeline.

Current limitations: Preset styles only at launch (no custom fonts or colors). No translation, SRT/VTT export, or webhooks in MVP.

Full pricing comparison: STT and subtitle APIs

API Free tier Batch / PAYG Subscription Self-hosted? Notes
Deepgram $200 credit ~$0.022/min Enterprise Fastest; voice agent bundle $4.50/hr
AssemblyAI ~$50 credit ~$0.006/min ~$0.37/hr No Best audio intelligence
OpenAI Whisper $0.006/min Yes (free) Cheapest managed; no streaming
Speechmatics 480 min/mo Contact sales Contact sales Enterprise Best multilingual
Google Cloud 60 min/mo Tiered No 100+ languages; complex setup
Azure Speech 5 hrs/mo Tiered Limited Enterprise compliance
ZapCap subtitle $0.10/min$0.03 transcript + $0.07 render No Animated templates
Submagic subtitle $0.10–$0.15/min $0.41/min base No Business+ tier req'd for API
Captions / Mirage subtitle $0.15/min No Animated captions; burn-in only
Shotstack video API $0.07–$0.40/min $0.04–$0.20/min No Full video API; subtitles as a feature
Creatomate video API $0.33–$0.54/min No Subscription only; animated captions
VEED
VEED Subtitle API
Q2 2026
TBC Duration-based No Full pipeline incl. styling + render

Source: VEED internal pricing benchmark + official provider pricing pages, April 2026. Verify all pricing before publishing. Subtitle API prices include transcription + rendering where applicable.

How to choose: decision framework

If you need… Choose Why
Real-time voice agent or conversational AI Deepgram
Sub-300ms latency; bundled Voice Agent API
Best raw accuracy + audio intelligence AssemblyAI
~8.4% WER; richest features; VEED's chosen provider
Lowest cost transcription (text out) OpenAI Whisper
$0.006/min API or free self-hosted
Multilingual, diverse accents, on-premise Speechmatics
55+ languages; flexible deployment; 480 free min/mo
Animated, social-native styled subtitles ZapCap or Submagic
Best styling; $0.10–$0.15/min PAYG
Highest-quality animated caption templates Captions / Mirage
Premium template quality; $0.15/min
Subtitles as part of a wider video workflow Shotstack or Creatomate
General video API; flexible pipeline
Styled subtitles + full video pipeline in one API VEED VEED Subtitle Styling API Q2 2026
Only option covering transcription → style → render + VEED's full suite

Recap and final thoughts

Here's what to remember:

  • Two different categories: pure STT APIs output text; styled subtitle APIs output video. Know which one your use case actually needs before evaluating pricing.
  • For voice agents and live captioning: Deepgram leads on speed; AssemblyAI leads on accuracy and features; Whisper leads on cost.
  • For styled, burned-in subtitles: ZapCap ($0.10/min), Submagic ($0.10–$0.15/min), and Captions/Mirage ($0.15/min) are the current market. All animate well for social video.
  • VEED adds the layer they all miss: VEED's Subtitle Styling API (Q2 2026) handles the full pipeline — transcription, styling with VEED's own preset library, and render — in one call, and integrates with lip sync, background removal, and AI video generation in the same platform.
  • Pricing ranges from free to $0.54/min: calculate on your actual volume, not the headline rate. Add-ons, resolutions, and subscription requirements change the real number significantly.

🔧 Next step: Explore VEED's video API suite now and watch for the Subtitle Styling API launch in Q2 2026 — veed.io/api.

Faq

What is the best speech-to-text API in 2026?

It depends on your use case. Deepgram Nova-3 leads for real-time voice agents (sub-300ms latency). AssemblyAI Universal-2 leads for accuracy and audio intelligence. OpenAI Whisper leads on cost at $0.006/min. If your goal is styled subtitles burned into video rather than raw transcription, ZapCap, Submagic, and Captions/Mirage are the dedicated subtitle API category, with VEED's Subtitle Styling API (Q2 2026) covering the full pipeline including VEED's own preset library.

What is the cheapest transcription API?

OpenAI Whisper is the cheapest at $0.006/min via the API, or free if self-hosted (GPU compute cost only). Speechmatics includes 480 free minutes per month, the most generous free tier. AssemblyAI is also $0.006/min for batch processing. Styled subtitle APIs are more expensive ($0.10 to $0.15/min) because they include rendering in the price.

What is the difference between a transcription API and a subtitle API?

A transcription API converts speech to text and stops there. A subtitle API handles the full pipeline: transcription, timestamp alignment, subtitle line formatting, visual styling, and burning the subtitles into the video file. The six-step gap between 'getting text' and 'getting a styled video' is where subtitle-focused APIs like ZapCap, Submagic, and Captions/Mirage operate: and where VEED's Subtitle Styling API is entering the market in Q2 2026.

Which subtitle API has the best-looking animations?

Captions/Mirage is generally cited for the highest-quality animated caption templates. ZapCap and Submagic are close behind with strong word-highlight and animated presets well-suited to social video. Shotstack and Creatomate support basic subtitle styling via JSON, flexible but less visually opinionated. VEED's Subtitle Styling API (Q2 2026) will apply VEED's own preset library, the same styles used in VEED's online video editor, refined across millions of creator videos.

Is there a free speech-to-text or subtitle API?

For transcription: Speechmatics includes 480 free minutes per month. Google Cloud gives 60 free minutes. Azure Speech gives 5 free hours. OpenAI Whisper is free to self-host. For styled subtitle APIs, most require PAYG or a subscription with no meaningful free tier. ZapCap, Submagic, and Captions/Mirage are all paid, though most offer a limited free trial for evaluation.

How does VEED's Subtitle Styling API work?

VEED's Subtitle Styling API (launching Q2 2026) accepts a video URL and a style preset, then handles transcription, subtitle formatting, visual styling, and rendering: returning a video file with VEED's styled subtitles burned in. You can also supply your own SRT file to skip transcription. Billing is duration-based per minute of input video. If you already have the other VEED APIs connected (Lip Sync, Background Remover, Fabric 1.0), the subtitle step integrates into the same pipeline. Documentation at veed.io/api.

When it comes to  amazing videos, all you need is VEED

Create your first video
No credit card required