Skip to main content

Overview

xAI provides two text-to-speech services:
  • XAIHttpTTSService: Batch synthesis via HTTP API. Sends complete text and receives the full audio response.
  • XAITTSService: Streaming synthesis via WebSocket. Streams text incrementally and receives audio chunks as they’re synthesized, reducing latency. Supports word-level timestamps for accurate timing of synthesized speech.
Both support multiple languages and audio encoding formats.

xAI TTS API Reference

Complete API reference for all parameters and methods

WebSocket Example

Streaming WebSocket example with interruption handling

HTTP Example

Batch HTTP example

xAI Documentation

Official xAI voice API documentation

Installation

uv add "pipecat-ai[xai]"

Prerequisites

  1. xAI Account: Sign up at xAI
  2. API Key: Generate an API key from your account dashboard (also works with Grok API keys)
Set the following environment variable:
export GROK_API_KEY=your_api_key

Configuration

XAIHttpTTSService

api_key
str
required
xAI API key for authentication.
base_url
str
default:"https://api.x.ai/v1/tts"
xAI TTS endpoint URL. Override for custom or proxied deployments.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
encoding
str
default:"pcm"
Output audio encoding format. Supported formats: "pcm", "mp3", "wav", "mulaw", "alaw".
aiohttp_session
aiohttp.ClientSession
default:"None"
Optional shared aiohttp session for HTTP requests. If None, the service creates and manages its own session.
settings
XAIHttpTTSService.Settings
default:"None"
Runtime-configurable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using XAIHttpTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.
ParameterTypeDefaultDescription
modelstrNoneModel identifier. (Inherited from base settings.)
voicestr"eve"Voice identifier. (Inherited from base settings.)
languageLanguage | strLanguage.ENLanguage code. (Inherited from base settings.)
speedfloatNoneSpeech speed multiplier from 0.7 to 1.5 (1.0 is normal).
optimize_streaming_latencyintNoneLatency optimization level (0, 1, or 2).
text_normalizationboolNoneWhether to normalize text before synthesis.

XAITTSService

api_key
str
required
xAI API key for authentication.
base_url
str
default:"wss://api.x.ai/v1/tts"
xAI TTS WebSocket endpoint URL. Override for custom or proxied deployments.
sample_rate
int
default:"None"
Output audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.
codec
str
default:"pcm"
Output audio codec. Supported codecs: "pcm", "wav", "mulaw", "alaw". Defaults to "pcm" so emitted TTSAudioRawFrame objects need no decoding downstream.
settings
XAITTSService.Settings
default:"None"
Runtime-configurable settings. Includes all settings from XAIHttpTTSService plus with_timestamps for word-level timing. Changing voice, language, or tunable parameters at runtime reconnects the WebSocket with new query parameters.

WebSocket Settings

Runtime-configurable settings for XAITTSService using XAITTSService.Settings(...). Includes all HTTP service settings plus:
ParameterTypeDefaultDescription
with_timestampsboolTrueWhether to request character timings. When enabled, the service converts them into per-word TTSTextFrame objects.

Supported Languages

xAI TTS supports 20 languages. Use the Language enum from pipecat.transcriptions.language:
  • Arabic (Egyptian, Saudi, UAE): Language.AR, Language.AR_EG, Language.AR_SA, Language.AR_AE
  • Bengali: Language.BN
  • Chinese: Language.ZH
  • English: Language.EN
  • French: Language.FR
  • German: Language.DE
  • Hindi: Language.HI
  • Indonesian: Language.ID
  • Italian: Language.IT
  • Japanese: Language.JA
  • Korean: Language.KO
  • Portuguese (Brazil, Portugal): Language.PT, Language.PT_BR, Language.PT_PT
  • Russian: Language.RU
  • Spanish (Spain, Mexico): Language.ES, Language.ES_ES, Language.ES_MX
  • Turkish: Language.TR
  • Vietnamese: Language.VI

Usage

WebSocket Streaming (XAITTSService)

Basic Setup

import os
from pipecat.services.xai.tts import XAITTSService

tts = XAITTSService(
    api_key=os.getenv("GROK_API_KEY"),
    settings=XAITTSService.Settings(
        voice="eve",
    ),
)

With Custom Language

from pipecat.transcriptions.language import Language

tts = XAITTSService(
    api_key=os.getenv("GROK_API_KEY"),
    settings=XAITTSService.Settings(
        voice="eve",
        language=Language.ES,
    ),
)

With Custom Sample Rate and Codec

tts = XAITTSService(
    api_key=os.getenv("GROK_API_KEY"),
    sample_rate=24000,
    codec="wav",
    settings=XAITTSService.Settings(
        voice="eve",
    ),
)

With Tunable Parameters

tts = XAITTSService(
    api_key=os.getenv("GROK_API_KEY"),
    settings=XAITTSService.Settings(
        voice="eve",
        speed=1.2,  # Faster speech
        optimize_streaming_latency=2,  # Maximum latency optimization
        text_normalization=True,  # Enable text normalization
        with_timestamps=True,  # Enable word timestamps (default)
    ),
)

HTTP Batch (XAIHttpTTSService)

Basic Setup

import os
from pipecat.services.xai.tts import XAIHttpTTSService

tts = XAIHttpTTSService(
    api_key=os.getenv("GROK_API_KEY"),
    settings=XAIHttpTTSService.Settings(
        voice="eve",
    ),
)

With Custom Encoding

tts = XAIHttpTTSService(
    api_key=os.getenv("GROK_API_KEY"),
    encoding="mp3",
    settings=XAIHttpTTSService.Settings(
        voice="eve",
    ),
)

With Shared HTTP Session

import aiohttp

async with aiohttp.ClientSession() as session:
    tts = XAIHttpTTSService(
        api_key=os.getenv("GROK_API_KEY"),
        aiohttp_session=session,
        settings=XAIHttpTTSService.Settings(
            voice="eve",
        ),
    )

Updating Settings at Runtime

Voice settings can be changed mid-conversation using TTSUpdateSettingsFrame. This works for both services:
from pipecat.frames.frames import TTSUpdateSettingsFrame
from pipecat.services.xai.tts import XAITTSSettings
from pipecat.transcriptions.language import Language

await worker.queue_frame(
    TTSUpdateSettingsFrame(
        delta=XAITTSSettings(
            language=Language.FR,
        )
    )
)
Note: For XAITTSService, changing voice or language settings reconnects the WebSocket with updated query parameters.

Notes

  • Service choice:
    • Use XAITTSService (WebSocket) for lower latency streaming synthesis where audio begins playing before the full utterance finishes.
    • Use XAIHttpTTSService (HTTP) for simpler batch synthesis or when WebSocket connections are not available.
  • Default audio format: Both services default to raw PCM output, which matches Pipecat’s downstream expectations without extra decoding.
  • Encoding/codec options: When using non-PCM formats (mp3, wav, mulaw, alaw), ensure your audio pipeline can handle the selected format.
  • Session management:
    • XAIHttpTTSService: If you don’t provide an aiohttp_session, the service creates and manages its own session lifecycle automatically.
    • XAITTSService: WebSocket connection is managed automatically; settings changes that affect URL parameters (voice, language, or tunable settings) trigger a reconnection.
  • Interruption handling: XAITTSService handles barge-in by sending a text.clear message over the existing WebSocket connection, avoiding the overhead of reconnecting on every interruption.
  • Word timestamps: When with_timestamps is enabled (the default), xAI’s per-character timings are converted into per-word TTSTextFrame objects with accurate pts values. Note that xAI delivers timestamps in batches that are decoupled from the audio stream (a batch can cover several seconds of speech), so word frames are emitted in bursts. Consumers should schedule off pts rather than arrival time.