Overview
xAI provides two text-to-speech services:- XAIHttpTTSService: Batch synthesis via HTTP API. Sends complete text and receives the full audio response.
- XAITTSService: Streaming synthesis via WebSocket. Streams text incrementally and receives audio chunks as they’re synthesized, reducing latency. Supports word-level timestamps for accurate timing of synthesized speech.
xAI TTS API Reference
Complete API reference for all parameters and methods
WebSocket Example
Streaming WebSocket example with interruption handling
HTTP Example
Batch HTTP example
xAI Documentation
Official xAI voice API documentation
Installation
Prerequisites
- xAI Account: Sign up at xAI
- API Key: Generate an API key from your account dashboard (also works with Grok API keys)
Configuration
XAIHttpTTSService
xAI API key for authentication.
xAI TTS endpoint URL. Override for custom or proxied deployments.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured
sample rate.Output audio encoding format. Supported formats:
"pcm", "mp3", "wav",
"mulaw", "alaw".Optional shared aiohttp session for HTTP requests. If
None, the service
creates and manages its own session.Settings
Runtime-configurable settings passed via thesettings constructor argument using XAIHttpTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | None | Model identifier. (Inherited from base settings.) |
voice | str | "eve" | Voice identifier. (Inherited from base settings.) |
language | Language | str | Language.EN | Language code. (Inherited from base settings.) |
speed | float | None | Speech speed multiplier from 0.7 to 1.5 (1.0 is normal). |
optimize_streaming_latency | int | None | Latency optimization level (0, 1, or 2). |
text_normalization | bool | None | Whether to normalize text before synthesis. |
XAITTSService
xAI API key for authentication.
xAI TTS WebSocket endpoint URL. Override for custom or proxied deployments.
Output audio sample rate in Hz. When
None, uses the pipeline’s configured
sample rate.Output audio codec. Supported codecs:
"pcm", "wav", "mulaw", "alaw".
Defaults to "pcm" so emitted TTSAudioRawFrame objects need no decoding
downstream.Runtime-configurable settings. Includes all settings from
XAIHttpTTSService
plus with_timestamps for word-level timing. Changing voice, language, or
tunable parameters at runtime reconnects the WebSocket with new query parameters.WebSocket Settings
Runtime-configurable settings forXAITTSService using XAITTSService.Settings(...). Includes all HTTP service settings plus:
| Parameter | Type | Default | Description |
|---|---|---|---|
with_timestamps | bool | True | Whether to request character timings. When enabled, the service converts them into per-word TTSTextFrame objects. |
Supported Languages
xAI TTS supports 20 languages. Use theLanguage enum from pipecat.transcriptions.language:
- Arabic (Egyptian, Saudi, UAE):
Language.AR,Language.AR_EG,Language.AR_SA,Language.AR_AE - Bengali:
Language.BN - Chinese:
Language.ZH - English:
Language.EN - French:
Language.FR - German:
Language.DE - Hindi:
Language.HI - Indonesian:
Language.ID - Italian:
Language.IT - Japanese:
Language.JA - Korean:
Language.KO - Portuguese (Brazil, Portugal):
Language.PT,Language.PT_BR,Language.PT_PT - Russian:
Language.RU - Spanish (Spain, Mexico):
Language.ES,Language.ES_ES,Language.ES_MX - Turkish:
Language.TR - Vietnamese:
Language.VI
Usage
WebSocket Streaming (XAITTSService)
Basic Setup
With Custom Language
With Custom Sample Rate and Codec
With Tunable Parameters
HTTP Batch (XAIHttpTTSService)
Basic Setup
With Custom Encoding
With Shared HTTP Session
Updating Settings at Runtime
Voice settings can be changed mid-conversation usingTTSUpdateSettingsFrame. This works for both services:
XAITTSService, changing voice or language settings reconnects the WebSocket with updated query parameters.
Notes
- Service choice:
- Use
XAITTSService(WebSocket) for lower latency streaming synthesis where audio begins playing before the full utterance finishes. - Use
XAIHttpTTSService(HTTP) for simpler batch synthesis or when WebSocket connections are not available.
- Use
- Default audio format: Both services default to raw PCM output, which matches Pipecat’s downstream expectations without extra decoding.
- Encoding/codec options: When using non-PCM formats (
mp3,wav,mulaw,alaw), ensure your audio pipeline can handle the selected format. - Session management:
XAIHttpTTSService: If you don’t provide anaiohttp_session, the service creates and manages its own session lifecycle automatically.XAITTSService: WebSocket connection is managed automatically; settings changes that affect URL parameters (voice, language, or tunable settings) trigger a reconnection.
- Interruption handling:
XAITTSServicehandles barge-in by sending atext.clearmessage over the existing WebSocket connection, avoiding the overhead of reconnecting on every interruption. - Word timestamps: When
with_timestampsis enabled (the default), xAI’s per-character timings are converted into per-wordTTSTextFrameobjects with accurateptsvalues. Note that xAI delivers timestamps in batches that are decoupled from the audio stream (a batch can cover several seconds of speech), so word frames are emitted in bursts. Consumers should schedule offptsrather than arrival time.