NVIDIA Nemotron Speech

Overview

NVIDIA Nemotron Speech provides three TTS service implementations:

NvidiaTTSService — High-quality TTS using both locally deployed and cloud-based NVIDIA TTS models. Supports multilingual synthesis, configurable quality settings, per-sentence and stitched synthesis modes, and zero-shot voice cloning.
NvidiaSageMakerHTTPTTSService — Single HTTP invocation to an AWS SageMaker endpoint, streaming raw PCM audio back for each text segment.
NvidiaSageMakerTTSService — Persistent HTTP/2 bidi-stream to an AWS SageMaker endpoint with full interruption support via InterruptibleTTSService.

NVIDIA Nemotron Speech TTS API Reference

Pipecat’s API methods for NVIDIA Nemotron Speech TTS integration

Example Implementation

Complete example with Nemotron Speech NIM

NVIDIA TTS NIM Documentation

Official NVIDIA TTS NIM documentation

NVIDIA Developer Portal

Access API keys and Nemotron Speech services

Installation

To use NVIDIA Nemotron Speech services, install the required dependencies:

uv add "pipecat-ai[nvidia]"

Prerequisites

NVIDIA Nemotron Speech Setup

Before using Nemotron Speech TTS services, you need:

NVIDIA Developer Account: Sign up at NVIDIA Developer Portal
API Key: Generate an NVIDIA API key for Nemotron Speech services (required for cloud endpoint)
Nemotron Speech Access: Ensure access to NVIDIA Nemotron Speech TTS services

For local deployments, see the NVIDIA TTS NIM documentation.

Required Environment Variables

NVIDIA_API_KEY: Your NVIDIA API key for authentication (required for cloud endpoint, not needed for local deployments)

Configuration

NvidiaTTSService

api_key

str

default:"None"

NVIDIA API key for authentication. Required when using the cloud endpoint. Not needed for local deployments.

server

str

default:"grpc.nvcf.nvidia.com:443"

gRPC server endpoint. Defaults to NVIDIA’s cloud endpoint. For local deployments, pass the local address (e.g. localhost:50051).

voice_id

str

default:"Magpie-Multilingual.EN-US.Aria"

deprecated

Voice model identifier.Deprecated in v0.0.105. Use settings=NvidiaTTSService.Settings(...) instead.

sample_rate

int

default:"None"

Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

model_function_map

dict

Dictionary containing function_id and model_name for the TTS model.

use_ssl

bool

default:"True"

Whether to use SSL for the gRPC connection. Defaults to True for the NVIDIA cloud endpoint. Set to False for local deployments.

custom_dictionary

dict

default:"None"

Custom pronunciation dictionary mapping words (graphemes) to IPA phonetic representations (phonemes), e.g. {"NVIDIA": "ɛn.vɪ.diː.ʌ"}. See NVIDIA TTS NIM phoneme support for the list of supported IPA phonemes.

zero_shot_audio_prompt_file

str | os.PathLike[str]

default:"None"

Optional audio prompt file for Magpie zero-shot voice cloning. NVIDIA recommends a 16-bit mono WAV prompt, sample rate 22.05 kHz or higher, and duration 3 to 10 seconds. Access to NVIDIA’s hosted zero-shot models requires approval through NVIDIA Riva TTS Zero-Shot Models.

audio_prompt_encoding

AudioEncoding

default:"AudioEncoding.ENCODING_UNSPECIFIED"

Audio encoding for zero_shot_audio_prompt_file. Use this when the server expects a specific prompt encoding for Magpie zero-shot voice cloning.

encoding

AudioEncoding

default:"AudioEncoding.LINEAR_PCM"

Output audio encoding format. Defaults to AudioEncoding.LINEAR_PCM.

params

InputParams

default:"None"

deprecated

Runtime-configurable synthesis settings. See Settings below.Deprecated in v0.0.105. Use settings=NvidiaTTSService.Settings(...) instead.

settings

NvidiaTTSService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`None`	Model identifier. (Inherited.)
`voice`	`str`	`None`	Voice identifier. (Inherited.)
`language`	`Language \| str`	`None`	Language for synthesis. (Inherited.)
`quality`	`int`	`NOT_GIVEN`	Audio quality setting (0-100). For Magpie zero-shot, NVIDIA expects values in range 1 to 40.
`synthesis_mode`	`NvidiaTTSSynthesisMode`	`NvidiaTTSSynthesisMode.PER_SENTENCE`	Whether to synthesize one sentence per request or stitch multiple sentences in one stream.

Usage

Basic Setup

from pipecat.services.nvidia import NvidiaTTSService

tts = NvidiaTTSService(
    api_key=os.getenv("NVIDIA_API_KEY"),
)

With Custom Voice and Quality

from pipecat.services.nvidia import NvidiaTTSService
from pipecat.transcriptions.language import Language

tts = NvidiaTTSService(
    api_key=os.getenv("NVIDIA_API_KEY"),
    model_function_map={
        "function_id": "877104f7-e885-42b9-8de8-f6e4c6303969",
        "model_name": "magpie-tts-multilingual",
    },
    settings=NvidiaTTSService.Settings(
        voice="Magpie-Multilingual.EN-US.Aria",
        language=Language.EN_US,
        quality=40,
    ),
)

The InputParams / params= pattern is deprecated as of v0.0.105. Use Settings / settings= instead. See the Service Settings guide for migration details.

Notes

gRPC-based: NVIDIA Nemotron Speech uses gRPC (not HTTP or WebSocket) for communication with the TTS service.
Synthesis modes: The service supports two synthesis modes via the synthesis_mode setting:
- PER_SENTENCE (default): Opens a separate SynthesizeOnline call for each sentence. Compatible with all NVIDIA TTS NIMs, including Chatterbox, Magpie multilingual, and Magpie zero-shot.
- STITCHED: Reuses one SynthesizeOnline stream across multiple sentences within the same LLM response for improved multi-sentence synthesis quality. Only use with models that support cross-sentence stitching, such as Magpie multilingual and Magpie zero-shot v1.7.0 or later.
Zero-shot voice cloning: Magpie zero-shot models support voice cloning via the zero_shot_audio_prompt_file parameter. NVIDIA recommends a 16-bit mono WAV prompt (22.05 kHz or higher, 3-10 seconds duration). Access to hosted zero-shot models requires approval.
Runtime settings updates: Voice, language, quality, and synthesis mode can be updated mid-conversation with TTSUpdateSettingsFrame. New values take effect on the next synthesis turn, not for the current turn’s in-flight requests.
Model cannot be changed after initialization: The model and function ID must be set during construction via model_function_map. Calling set_model() after initialization will log a warning and have no effect.
SSL enabled by default: The service connects to NVIDIA’s cloud endpoint with SSL. Set use_ssl=False only for local or custom Nemotron Speech deployments.
Metrics generation: This service supports metric generation via can_generate_metrics(). Metrics are automatically stopped when an audio context is interrupted.

NvidiaSageMakerHTTPTTSService

NVIDIA Magpie TTS service that calls a SageMaker HTTP endpoint for each text segment. Sends JSON to the endpoint’s /invocations path and streams raw PCM audio back.

Configuration

endpoint_name

str

required

Name of the deployed SageMaker endpoint.

region

str

default:"us-west-2"

AWS region where the endpoint is deployed.

sample_rate

int

default:"None"

Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

settings

NvidiaSageMakerHTTPTTSService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSageMakerHTTPTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`magpie`	Model identifier. (Inherited.)
`voice`	`str`	`Magpie-Multilingual.EN-US.Aria`	Voice identifier. (Inherited.)
`language`	`Language \| str`	`en-US`	BCP-47 language code for synthesis.

Usage

from pipecat.services.nvidia.sagemaker.tts import NvidiaSageMakerHTTPTTSService

tts = NvidiaSageMakerHTTPTTSService(
    endpoint_name=os.getenv("SAGEMAKER_MAGPIE_ENDPOINT_NAME"),
    region=os.getenv("AWS_REGION", "us-west-2"),
    settings=NvidiaSageMakerHTTPTTSService.Settings(
        voice="Magpie-Multilingual.EN-US.Aria",
        language="en-US",
    ),
)

Notes

AWS SageMaker deployment required: This service requires a deployed SageMaker endpoint running NVIDIA Magpie TTS NIM. See the deployment example for setup instructions.
AWS credentials: Requires AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables for SageMaker authentication.
Environment variables: SAGEMAKER_MAGPIE_ENDPOINT_NAME for the endpoint name.
HTTP-based: Each text segment triggers a new HTTP POST request to the SageMaker endpoint.
Metrics support: This service supports metrics generation (can_generate_metrics() returns True).

NvidiaSageMakerTTSService

NVIDIA Magpie TTS service using SageMaker bidirectional streaming. Maintains a persistent HTTP/2 bidi-stream connection for the lifetime of the pipeline with full interruption support.

Configuration

endpoint_name

str

required

Name of the deployed SageMaker endpoint.

region

str

default:"us-west-2"

AWS region where the endpoint is deployed.

sample_rate

int

default:"None"

Audio sample rate in Hz. When None, uses the pipeline’s configured sample rate.

settings

NvidiaSageMakerTTSService.Settings

default:"None"

Runtime-configurable settings. See Settings below.

Settings

Runtime-configurable settings passed via the settings constructor argument using NvidiaSageMakerTTSService.Settings(...). These can be updated mid-conversation with TTSUpdateSettingsFrame. See Service Settings for details.

Parameter	Type	Default	Description
`model`	`str`	`magpie`	Model identifier. (Inherited.)
`voice`	`str`	`Magpie-Multilingual.EN-US.Aria`	Voice identifier. (Inherited.)
`language`	`Language \| str`	`en-US`	BCP-47 language code for synthesis.

Usage

from pipecat.services.nvidia.sagemaker.tts import NvidiaSageMakerTTSService

tts = NvidiaSageMakerTTSService(
    endpoint_name=os.getenv("SAGEMAKER_MAGPIE_ENDPOINT_NAME"),
    region=os.getenv("AWS_REGION", "us-west-2"),
    settings=NvidiaSageMakerTTSService.Settings(
        voice="Magpie-Multilingual.EN-US.Aria",
        language="en-US",
    ),
)

Notes

AWS SageMaker deployment required: This service requires a deployed SageMaker endpoint running NVIDIA Magpie TTS NIM. See the deployment example for setup instructions.
AWS credentials: Requires AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables for SageMaker authentication.
Environment variables: SAGEMAKER_MAGPIE_ENDPOINT_NAME for the endpoint name.
Persistent connection: Maintains a single HTTP/2 bidi-stream session for the pipeline’s lifetime, reconnecting automatically on error.
Interruption support: Extends InterruptibleTTSService for proper handling of user interruptions.
Metrics support: This service supports metrics generation (can_generate_metrics() returns True).

​Overview

NVIDIA Nemotron Speech TTS API Reference

Example Implementation

NVIDIA TTS NIM Documentation

NVIDIA Developer Portal

​Installation

​Prerequisites

​NVIDIA Nemotron Speech Setup

​Required Environment Variables

​Configuration

​NvidiaTTSService

​Settings

​Usage

​Basic Setup

​With Custom Voice and Quality

​Notes

​NvidiaSageMakerHTTPTTSService

​Configuration

​Settings

​Usage

​Notes

​NvidiaSageMakerTTSService

​Configuration

​Settings

​Usage

​Notes

Overview

Installation

Prerequisites

NVIDIA Nemotron Speech Setup

Required Environment Variables

Configuration

NvidiaTTSService

Settings

Usage

Basic Setup

With Custom Voice and Quality

Notes

NvidiaSageMakerHTTPTTSService

Configuration

Settings

Usage

Notes

NvidiaSageMakerTTSService

Configuration

Settings

Usage

Notes