Browse and compare AI models across providers, modalities, and use cases.
Showing 20 of 61 models
Generate natural-sounding multi-speaker dialogues, and audio. Perfect for expressive outputs, storytelling, games, animations, and interactive media.
CassetteAI’s model generates a 30-second sample in under 2 seconds and a full 3-minute track in under 10 seconds. At 44.1 kHz stereo audio, expect a level of professional consistency with no breaks, no squeaks, and no random interruptions in your creations.
To specify which model you want to use, set the model parameter in your API requests:
Highly conversational output with natural pacing and intonation; Better handling of filler words and casual speech; Instant voice cloning that better preserves accents and speaker styles
CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs.
DiffRhythm is a blazing fast model for transforming lyrics into full songs. It boasts the capability to generate full songs in less than 30 seconds.
Isolate audio tracks using ElevenLabs advanced audio isolation technology.
Generate sound effects using ElevenLabs advanced sound effects model.
Generate multilingual text-to-speech audio using ElevenLabs TTS Multilingual v2.
Generate high-speed text-to-speech audio using ElevenLabs TTS Turbo v2.5.
Kokoro is a lightweight text-to-speech model that delivers comparable quality to larger models while being significantly faster and more cost-efficient.
A natural and expressive Brazilian Portuguese text-to-speech model optimized for clarity and fluency.
A high-quality British English text-to-speech model offering natural and expressive voice synthesis.
An expressive and natural French text-to-speech model for both European and Canadian French.
A fast and expressive Hindi text-to-speech model with clear pronunciation and accurate intonation.
A high-quality Italian text-to-speech model delivering smooth and expressive speech synthesis.
A fast and natural-sounding Japanese text-to-speech model optimized for smooth pronunciation.
A highly efficient Mandarin Chinese text-to-speech model that captures natural tones and prosody.
A natural-sounding Spanish text-to-speech model optimized for Latin American and European Spanish.