Today, we're excited to release the first step in our mission of building real time multimodal intelligence: Sonic, a blazing fast (🚀 135ms model latency), lifelike generative voice model and API.
At Cartesia, our mission is to build real time multimodal intelligence for every device 💻 . We believe this future requires fundamentally new, efficient architectures for intelligence.
Sonic is built on a new state space model architecture we’ve developed for efficiently modeling high resolution signals like audio and video. On speech, a parameter-matched and optimized Cartesia Sonic model trained on the same data as a widely used Transformer architecture improves audio quality (20% lower perplexity, 2x lower Word Error Rate, 1 point higher on the NISQA evaluation). At inference, it achieves lower latency (1.5x lower time-to-first-audio), faster inference speed (2x lower real-time factor), and higher throughput (4x).
The playground features a diverse voice library for applications across customer support, entertainment, and content creation with support for instant cloning and voice design (speed, emotion), all of which can be used through the API.
Read more in our release blog: https://lnkd.in/gcGxQd-B
Check us out on Product Hunt: https://lnkd.in/gA9h64yR
Try Sonic: https://play.cartesia.ai
We're hiring across all roles! If this excites you, drop us a note at [email protected]