Browse and compare AI models across providers, modalities, and use cases.
Showing 20 of 52 models
Generate natural-sounding multi-speaker dialogues, and audio. Perfect for expressive outputs, storytelling, games, animations, and interactive media.
CassetteAI’s model generates a 30-second sample in under 2 seconds and a full 3-minute track in under 10 seconds. At 44.1 kHz stereo audio, expect a level of professional consistency with no breaks, no squeaks, and no random interruptions in your creations.
Stable, production-ready model, recommended for most users. Offers reliable performance with well-tested features.
Experimental model with highly conversational output, natural pacing, better filler words, and instant voice cloning. Higher latency than Aurora.
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Isolate audio tracks using ElevenLabs advanced audio isolation technology.
Generate sound effects using ElevenLabs advanced sound effects model.
Generate multilingual text-to-speech audio using ElevenLabs TTS Multilingual v2.
Generate high-speed text-to-speech audio using ElevenLabs TTS Turbo v2.5.
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
A natural and expressive Brazilian Portuguese text-to-speech model optimized for clarity and fluency.
A high-quality British English text-to-speech model offering natural and expressive voice synthesis.
An expressive and natural French text-to-speech model for both European and Canadian French.
A fast and expressive Hindi text-to-speech model with clear pronunciation and accurate intonation.
A high-quality Italian text-to-speech model delivering smooth and expressive speech synthesis.
A fast and natural-sounding Japanese text-to-speech model optimized for smooth pronunciation.
A highly efficient Mandarin Chinese text-to-speech model that captures natural tones and prosody.
A natural-sounding Spanish text-to-speech model optimized for Latin American and European Spanish.