Getting text from audio is a solved problem. Every major cloud provider has a transcription API, pricing is low, and integration takes an afternoon. The problem is that your users don't want text: they want a video with subtitles on it.
Between getting a transcript and getting styled, branded, burned-in subtitles on a video file, there are five more engineering problems: timestamp alignment, line break formatting, visual styling, rendering, and output. Most teams underestimate this until they're in production.
This comparison covers the full market: pure speech-to-text APIs (Deepgram, AssemblyAI, Whisper, Speechmatics) and styled subtitle APIs that output finished video (ZapCap, Submagic, Captions/Mirage, Shotstack, Creatomate, and VEED's upcoming Subtitle Styling API). Both categories, real pricing, honest tradeoffs.
Key takeaways:
- There are two distinct categories: pure speech-to-text APIs that output text, and subtitle APIs that output styled video.
- Best pure STT options: Deepgram (speed), AssemblyAI (accuracy and audio intelligence), Whisper (lowest cost), Speechmatics (multilingual).
- Best styled subtitle APIs: ZapCap, Submagic, and Captions/Mirage all deliver animated captions with burned-in output.
- Pure STT APIs stop at the transcript. Getting to a styled video requires five more steps: timing, formatting, styling, burn-in, and rendering.
- VEED's Subtitle Styling API (Q2 2026) handles the full chain in one call, using VEED's own preset library.
.png)
Speech-to-text API vs. subtitle API: what's the difference?
If you're building a voice agent or live captioning system, a pure STT API is the right fit. If you're automating subtitles on video content, a subtitle API handles the parts that matter to your users and saves you weeks of rendering pipeline work.
Best speech-to-text APIs (transcription only)
Deepgram: fastest, best for voice agents
‼️ Verdict: Sub-300ms latency with Nova-3. Best for real-time voice applications. Bundled Voice Agent API at $4.50/hr removes cost surprises.
Deepgram Nova-3 delivers real-time streaming transcription under 300ms, which is essential for voice agents where delays above 500ms feel unnatural. The Voice Agent API bundles STT, LLM orchestration, and TTS at a flat $4.50/hour, making cost predictable at scale. For batch video transcription, the async API handles large files efficiently. Text out only: if you need subtitles on a video, Deepgram is step one of six.
Best for: Real-time voice agents, live captioning, conversational AI.
Pricing: Batch ~$0.022/min; Voice Agent API $4.50/hr bundled. Verify at deepgram.com/pricing.
AssemblyAI: best accuracy and audio intelligence
‼️ Verdict: Universal-2 leads on accuracy with ~8.4% WER and 30% fewer hallucinations than Whisper Large-v3. Best for accuracy-critical workflows and audio intelligence features.
AssemblyAI Universal-2 delivers best-in-class accuracy on real-world audio, not just clean benchmarks. Beyond transcription, it adds sentiment analysis, PII redaction, topic detection, and speaker diarization in the same API call, making it the richest platform for applications where text quality matters downstream. AssemblyAI is also the transcription provider powering VEED's own online video editor, a practical signal of production-grade reliability on real video content.
Best for: Accuracy-critical content, audio intelligence, multi-speaker video.
Pricing: Batch ~$0.006/min ($0.37/hr); streaming $0.45/hr. Verify at assemblyai.com/pricing.
OpenAI Whisper: cheapest and open source
‼️Verdict: Top accuracy in clean-audio benchmarks. $0.006/min via API, free if self-hosted. No real-time streaming.
Whisper holds top position on clean-audio accuracy benchmarks and supports 99+ languages without fine-tuning. At $0.006/min it's the cheapest managed transcription API available. The limitations are well-established: no native real-time streaming (batching workarounds add latency), and self-hosting the largest model requires GPU infrastructure. For cost-optimized batch transcription of video content where speed isn't critical, Whisper is the default.
Best for: Cost-optimized batch transcription, multilingual, open source.
Pricing: $0.006/min via OpenAI API; free self-hosted (GPU cost). Verify at openai.com/api/pricing.
Speechmatics: best for multilingual and on-premise
‼️ Verdict: 55+ languages with strong accent handling. Best deployment flexibility: cloud, on-premise, edge. 480 free minutes per month.
Speechmatics differentiates on language breadth and deployment flexibility. With 55+ languages and strong performance on diverse accents (particularly British English and regional dialects), it's the go-to for content teams targeting non-English markets. On-premise and edge deployment options make it the natural choice for regulated industries with data residency requirements. The 480 free minutes per month is the most generous trial tier in the STT category.
Best for: Multilingual content, diverse accents, on-premise or regulated deployment.
Pricing: 480 free min/mo; enterprise pricing via contact. Verify at speechmatics.com/pricing.
Best subtitle APIs: styled video output
These APIs accept a video file and return a video with subtitles burned in. They handle transcription, formatting, styling, and rendering: the full chain that pure STT APIs leave to you. If your end goal is a post-ready video, start here.
ZapCap: animated templates, strong styling
Verdict: Strong competitive option. Animated templates, word-highlight, emojis, custom fonts. PAYG at $0.10/min (split: $0.03 transcription + $0.07 rendering).
ZapCap is one of the most popular subtitle-focused APIs and a direct competitor to VEED's upcoming Subtitle Styling API. It supports animated caption templates, word-highlight (karaoke-style), emoji overlays, and custom fonts: the style features that drive engagement on social video. Pricing is transparent: $0.03/min for transcription, $0.07/min for rendering, totaling $0.10/min PAYG with 2x for 4K output.
Best for: Developers who need animated, social-ready subtitles with strong style options.
Pricing: $0.10/min PAYG ($0.03 transcript + $0.07 render); 2x rate for 4K. Verify at ZapCap's pricing page.
Submagic: animated captions with B-roll and silence removal
Verdict: Richest feature set in the subtitle API category: B-roll generation, silence removal, keyword highlights, emojis. Higher price on subscription; competitive PAYG add-on.
Submagic goes beyond subtitles into broader content automation: animated captions, keyword highlights, emoji overlays, B-roll integration, and silence removal in the same pipeline. The API is available as a PAYG credit add-on to Business+ plans ($41/month base), with per-minute pricing from $0.10 to $0.15. The subscription requirement adds friction for developers looking for pure PAYG access, but the feature set is the broadest in this category.
Best for: Social media content automation requiring captions plus B-roll and audio cleanup.
Pricing: Business+ plan required ($41/mo); PAYG API credits $0.10 to $0.15/min. Verify at Submagic's pricing page.
Captions / Mirage: best-looking animated caption templates
Verdict: Known for the highest-quality animated caption templates. $0.15/min PAYG. Strong styling quality.
Captions (also known as Mirage for the API product) has built a strong reputation for caption template quality. The animations, transitions, and visual polish are consistently cited as best-in-category. At $0.15/min PAYG it's priced at a slight premium to ZapCap, justified by template quality. The API applies styled captions to video files and returns a rendered output.
Best for: Premium animated subtitles where visual quality is the priority.
Pricing: $0.15/min PAYG. Verify at Captions' pricing page.
Shotstack and Creatomate: general video APIs with subtitle support
Verdict: Not subtitle-specialists. Subtitles are a feature within a broader video editing API. More flexible but less opinionated on caption styling.
Shotstack and Creatomate are general-purpose video editing APIs where subtitles are one capability among many. Both support SRT/VTT burn-in, custom fonts, colors, and position via JSON configuration. Creatomate supports word-by-word animated captions and custom templates via its template editor. Neither has the animated social-native presets of ZapCap or Submagic, but their broader scope makes them better fits for teams who need subtitle generation as part of a wider video production pipeline.
Best for: Teams needing subtitle generation as part of a full video editing workflow.
Pricing: Shotstack: PAYG $0.07 to $0.40/min; subscription $0.04 to $0.20/min. Creatomate: subscription only, $0.33 to $0.54/min. Verify on their pricing pages.
VEED's Subtitle Styling API: the full pipeline option
Coming Q2 2026: VEED's Subtitle Styling API is in development. Details reflect the confirmed product spec. The API is not yet publicly available.
Every API above, whether STT or subtitle-focused, represents part of the workflow. VEED's Subtitle Styling API, launching Q2 2026, is built to handle the complete chain in a single call.
Versus pure STT APIs (Deepgram, AssemblyAI, Whisper): VEED doesn't stop at the transcript. It applies visual styling, renders the subtitles into the video file, and returns a finished, post-ready output. No rendering pipeline to build, no style system to maintain.
Versus subtitle-focused APIs (ZapCap, Submagic, Captions/Mirage): VEED's style presets are the same ones used in VEED's own editor, refined across millions of videos by real creators. The quality bar is set by VEED's own product, not a generic template library. And because VEED handles the full video pipeline (lip sync, background removal, AI generation, and now subtitles) teams can automate a complete localization and production workflow through one API.
How it works
1. Send a video URL and select a style preset
2. VEED transcribes the audio using enterprise-grade speech recognition
3. The Subtitle Styling API applies the preset: timing, formatting, styling, animations
4. VEED's render engine burns the subtitles into the video
5. Poll for the result and receive a finished, styled video
Teams with their own SRT file can skip transcription entirely: useful when auto-transcription output needs correction, or when translation has already been handled separately.
VEED's video API suite (available now)
- VEED API overview: full documentation and API access
- Lip Sync API: sync translated audio to video in 35+ languages
- Background Remover API: remove backgrounds at scale
- Fabric 1.0 API: generate AI video from a text prompt
- Subtitle Styling API: launching Q2 2026. [Add waitlist link when available]
Best for: Automated video subtitle generation: styled, burned-in, post-ready, without managing a rendering pipeline.
Current limitations: Preset styles only at launch (no custom fonts or colors). No translation, SRT/VTT export, or webhooks in MVP.
Full pricing comparison: STT and subtitle APIs
Source: VEED internal pricing benchmark + official provider pricing pages, April 2026. Verify all pricing before publishing. Subtitle API prices include transcription + rendering where applicable.
How to choose: decision framework
Recap and final thoughts
.png)
Here's what to remember:
- Two different categories: pure STT APIs output text; styled subtitle APIs output video. Know which one your use case actually needs before evaluating pricing.
- For voice agents and live captioning: Deepgram leads on speed; AssemblyAI leads on accuracy and features; Whisper leads on cost.
- For styled, burned-in subtitles: ZapCap ($0.10/min), Submagic ($0.10–$0.15/min), and Captions/Mirage ($0.15/min) are the current market. All animate well for social video.
- VEED adds the layer they all miss: VEED's Subtitle Styling API (Q2 2026) handles the full pipeline — transcription, styling with VEED's own preset library, and render — in one call, and integrates with lip sync, background removal, and AI video generation in the same platform.
- Pricing ranges from free to $0.54/min: calculate on your actual volume, not the headline rate. Add-ons, resolutions, and subscription requirements change the real number significantly.
🔧 Next step: Explore VEED's video API suite now and watch for the Subtitle Styling API launch in Q2 2026 — veed.io/api.



