Hume Octave Text-to-Speech API
REST API for synthesizing expressive speech using Octave. Supports streamed JSON/file and standard JSON/file responses, plus voice conversion endpoints.
REST API for synthesizing expressive speech using Octave. Supports streamed JSON/file and standard JSON/file responses, plus voice conversion endpoints.
openapi: 3.1.0
info:
title: Text-to-Speech (TTS)
version: 1.0.0
paths:
/v0/tts/stream/json:
post:
operationId: synthesize-json-streaming
summary: Text-to-Speech (Streamed JSON)
description: >-
Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated
dynamically. Optionally, additional context can be included to influence the speech's style and prosody.
The response is a stream of JSON objects including audio encoded in base64.
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: Successful Response
content:
text/event-stream:
schema:
$ref: '#/components/schemas/TtsOutput'
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/OctaveBodyArgsStream'
/v0/tts/stream/file:
post:
operationId: synthesize-file-streaming
summary: Text-to-Speech (Streamed File)
description: >-
Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated
dynamically. Optionally, additional context can be included to influence the speech's style and prosody.
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: OK
content:
application/octet-stream:
schema:
type: string
format: binary
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/OctaveBodyArgsStream'
/v0/tts:
post:
operationId: synthesize-json
summary: Text-to-Speech (Json)
description: >-
Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel
voice will be generated dynamically. Optionally, additional context can be included to influence the speech's
style and prosody.
The response includes the base64-encoded audio and metadata in JSON format.
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: Successful Response
content:
application/json:
schema:
$ref: '#/components/schemas/OctaveResponse'
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/OctaveBodyArgs'
/v0/tts/file:
post:
operationId: synthesize-file
summary: Text-to-Speech (File)
description: >-
Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel
voice will be generated dynamically. Optionally, additional context can be included to influence the speech's
style and prosody.
The response contains the generated audio file in the requested format.
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: OK
content:
application/octet-stream:
schema:
type: string
format: binary
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
application/json:
schema:
$ref: '#/components/schemas/OctaveBodyArgs'
/v0/tts/voice_conversion/file:
post:
operationId: convert-voice-file
summary: Voice Conversion (Streamed File)
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: Successful Response
content:
application/octet-stream:
schema:
type: string
format: binary
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
multipart/form-data:
schema:
type: object
properties:
strip_headers:
type: boolean
description: >-
If enabled, the audio for all the chunks of a generation, once concatenated together, will
constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio
file, each with its own headers (if applicable).
audio:
type: string
format: binary
description: >-
Audio file containing speech to be converted to the target voice. Supported formats include `MP3`,
`WAV`, `M4A`, and `OGG`.
context:
oneOf:
- $ref: >-
#/components/schemas/V0TtsVoiceConversionFilePostRequestBodyContentMultipartFormDataSchemaContext
- type: 'null'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple
requests. These will not be converted to speech output.
voice:
$ref: '#/components/schemas/VoiceRef'
format:
$ref: '#/components/schemas/Format'
description: Specifies the output audio file format.
include_timestamp_types:
type: array
items:
$ref: '#/components/schemas/TimestampType'
description: >-
The set of timestamp types to include in the response. When used in multipart/form-data, specify
each value using bracket notation:
`include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2
requests.
required:
- audio
/v0/tts/voice_conversion/json:
post:
operationId: convert-voice-json
summary: Voice Conversion (Streamed JSON)
tags:
- ''
parameters:
- name: X-Hume-Api-Key
in: header
required: true
schema:
type: string
responses:
'200':
description: Successful Response
content:
text/event-stream:
schema:
$ref: '#/components/schemas/TtsOutput'
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/HTTPValidationError'
requestBody:
content:
multipart/form-data:
schema:
type: object
properties:
strip_headers:
type: boolean
description: >-
If enabled, the audio for all the chunks of a generation, once concatenated together, will
constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio
file, each with its own headers (if applicable).
audio:
type: string
format: binary
description: >-
Audio file containing speech to be converted to the target voice. Supported formats include `MP3`,
`WAV`, `M4A`, and `OGG`.
context:
oneOf:
- $ref: >-
#/components/schemas/V0TtsVoiceConversionJsonPostRequestBodyContentMultipartFormDataSchemaContext
- type: 'null'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple
requests. These will not be converted to speech output.
voice:
$ref: '#/components/schemas/VoiceRef'
format:
$ref: '#/components/schemas/Format'
description: Specifies the output audio file format.
include_timestamp_types:
type: array
items:
$ref: '#/components/schemas/TimestampType'
description: >-
The set of timestamp types to include in the response. When used in multipart/form-data, specify
each value using bracket notation:
`include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2
requests.
servers:
- url: https://api.hume.ai
components:
schemas:
ContextGenerationId:
type: object
properties:
generation_id:
type: string
format: uuid4
description: >-
The ID of a prior TTS generation to use as context for generating consistent speech style and prosody across
multiple requests. Including context may increase audio generation times.
required:
- generation_id
title: ContextGenerationId
VoiceProvider:
type: string
enum:
- HUME_AI
- CUSTOM_VOICE
title: VoiceProvider
VoiceId:
type: object
properties:
id:
type: string
description: The unique ID associated with the **Voice**.
provider:
$ref: '#/components/schemas/VoiceProvider'
description: >-
Specifies the source provider associated with the chosen voice.
- **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library),
containing a variety of preset, shared voices.
- **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account.
If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's
**Voice Library**, you must explicitly set the provider to `HUME_AI`.
Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are
private and accessible only via requests authenticated with your API key.
required:
- id
title: VoiceId
VoiceName:
type: object
properties:
name:
type: string
description: The name of a **Voice**.
provider:
$ref: '#/components/schemas/VoiceProvider'
description: >-
Specifies the source provider associated with the chosen voice.
- **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library),
containing a variety of preset, shared voices.
- **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account.
If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's
**Voice Library**, you must explicitly set the provider to `HUME_AI`.
Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are
private and accessible only via requests authenticated with your API key.
required:
- name
title: VoiceName
VoiceRef:
oneOf:
- $ref: '#/components/schemas/VoiceId'
- $ref: '#/components/schemas/VoiceName'
title: VoiceRef
Utterance:
type: object
properties:
description:
type:
- string
- 'null'
description: >-
Natural language instructions describing how the synthesized speech should sound, including but not limited
to tone, intonation, pacing, and accent.
**This field behaves differently depending on whether a voice is specified**:
- **Voice specified**: the description will serve as acting directions for delivery. Keep directions
concise—100 characters or fewer—for best results. See our guide on [acting
instructions](/docs/text-to-speech-tts/acting-instructions).
- **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our
[prompting guide](/docs/text-to-speech-tts/prompting) for design tips.
speed:
type: number
format: double
default: 1
description: >-
Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause
instability to the generated output.
text:
type: string
description: The input text to be synthesized into speech.
trailing_silence:
type: number
format: double
default: 0
description: Duration of trailing silence (in seconds) to add to this utterance
voice:
oneOf:
- $ref: '#/components/schemas/VoiceRef'
- type: 'null'
description: >-
The `name` or `id` associated with a **Voice** from the **Voice Library** to be used as the speaker for this
and all subsequent `utterances`, until the `voice` field is updated again.
See our [voices guide](/docs/text-to-speech-tts/voices) for more details on generating and specifying **Voices**.
required:
- text
title: Utterance
ContextUtterances:
type: object
properties:
utterances:
type: array
items:
$ref: '#/components/schemas/Utterance'
required:
- utterances
title: ContextUtterances
OctaveBodyArgsStreamContext:
oneOf:
- $ref: '#/components/schemas/ContextGenerationId'
- $ref: '#/components/schemas/ContextUtterances'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple requests. These
will not be converted to speech output.
title: OctaveBodyArgsStreamContext
OctaveBodyArgsStreamFormat:
oneOf:
- type: object
properties:
type:
type: string
enum:
- mp3
description: 'Discriminator value: mp3'
required:
- type
description: Mp3Format variant
- type: object
properties:
type:
type: string
enum:
- pcm
description: 'Discriminator value: pcm'
required:
- type
description: PcmFormat variant
- type: object
properties:
type:
type: string
enum:
- wav
description: 'Discriminator value: wav'
required:
- type
description: WavFormat variant
discriminator:
propertyName: type
description: Specifies the output audio file format.
title: OctaveBodyArgsStreamFormat
TimestampType:
type: string
enum:
- word
- phoneme
title: TimestampType
OctaveVersion:
type: string
enum:
- '1'
- '2'
description: |-
Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume
automatically routes the request to the most appropriate model. Setting a specific version ensures stable and
repeatable behavior across requests.
Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a
`voice`. Requests that set `version: 2` without a voice will be rejected.
For a comparison of Octave versions, see the
[Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.
title: OctaveVersion
OctaveBodyArgsStream:
type: object
properties:
context:
oneOf:
- $ref: '#/components/schemas/OctaveBodyArgsStreamContext'
- type: 'null'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple requests.
These will not be converted to speech output.
format:
$ref: '#/components/schemas/OctaveBodyArgsStreamFormat'
description: Specifies the output audio file format.
include_timestamp_types:
type: array
items:
$ref: '#/components/schemas/TimestampType'
description: The set of timestamp types to include in the response. Only supported for Octave 2 requests.
instant_mode:
type: boolean
default: true
description: >-
Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is
received. Recommended for real-time applications requiring immediate audio playback. For further details,
see our documentation on [instant
mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode).
- A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be
specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.
- Instant mode is only supported for streaming endpoints (e.g.,
[/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming),
[/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).
- Ensure only a single generation is requested
([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations)
must be `1` or omitted).
num_generations:
type: integer
default: 1
description: >-
Number of audio generations to produce from the input utterances.
Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally,
specifying `num_generations` allows prosody continuation across all generations without repeating context,
ensuring each generation sounds slightly different while maintaining contextual consistency.
split_utterances:
type: boolean
default: true
description: >-
Controls how audio output is segmented in the response.
- When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments.
- When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances
and output snippets.
This setting affects how the `snippets` array is structured in the response, which may be important for
applications that need to track the relationship between input text and generated audio segments. When
setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
strip_headers:
type: boolean
default: false
description: >-
If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a
single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own
headers (if applicable).
utterances:
type: array
items:
$ref: '#/components/schemas/Utterance'
description: >-
A list of **Utterances** to be converted to speech output.
An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input
`text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional
`voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`.
version:
$ref: '#/components/schemas/OctaveVersion'
description: >-
Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume
automatically routes the request to the most appropriate model. Setting a specific version ensures stable
and repeatable behavior across requests.
Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a
`voice`. Requests that set `version: 2` without a voice will be rejected.
For a comparison of Octave versions, see the [Octave
versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.
required:
- utterances
title: OctaveBodyArgsStream
FormatType:
type: string
enum:
- mp3
- pcm
- wav
title: FormatType
MillisecondInterval:
type: object
properties:
begin:
type: integer
description: Start time of the interval in milliseconds.
end:
type: integer
description: End time of the interval in milliseconds.
required:
- begin
- end
title: MillisecondInterval
Timestamp-Output:
type: object
properties:
text:
type: string
description: The word or phoneme text that the timestamp corresponds to.
time:
$ref: '#/components/schemas/MillisecondInterval'
description: The start and end timestamps for the word or phoneme in milliseconds.
type:
$ref: '#/components/schemas/TimestampType'
required:
- text
- time
- type
title: Timestamp-Output
Snippet-Output:
type: object
properties:
audio:
type: string
description: The segmented audio output in the requested format, encoded as a base64 string.
generation_id:
type: string
format: uuid4
description: The generation ID this snippet corresponds to.
id:
type: string
format: uuid4
description: A unique ID associated with this **Snippet**.
text:
type: string
description: The text for this **Snippet**.
timestamps:
type: array
items:
$ref: '#/components/schemas/Timestamp-Output'
description: >-
A list of word or phoneme level timestamps for the generated audio. Timestamps are only returned for Octave
2 requests.
transcribed_text:
type:
- string
- 'null'
description: The transcribed text of the generated audio. It is only present if `instant_mode` is set to `false`.
utterance_index:
type:
- integer
- 'null'
description: The index of the utterance in the request this snippet corresponds to.
required:
- audio
- generation_id
- id
- text
- timestamps
- transcribed_text
- utterance_index
title: Snippet-Output
TtsOutput:
oneOf:
- type: object
properties:
type:
type: string
enum:
- audio
description: 'Discriminator value: audio'
audio:
type: string
description: The generated audio output chunk in the requested format.
audio_format:
$ref: '#/components/schemas/FormatType'
description: The generated audio output format.
chunk_index:
type: integer
description: The index of the audio chunk in the snippet.
generation_id:
type: string
format: uuid4
description: The generation ID of the parent snippet that this chunk corresponds to.
is_last_chunk:
type: boolean
description: Whether or not this is the last chunk streamed back from the decoder for one input snippet.
request_id:
type: string
description: ID of the initiating request.
snippet:
$ref: '#/components/schemas/Snippet-Output'
snippet_id:
type: string
format: uuid4
description: The ID of the parent snippet that this chunk corresponds to.
text:
type: string
description: The text of the parent snippet that this chunk corresponds to.
transcribed_text:
type:
- string
- 'null'
description: >-
The transcribed text of the generated audio of the parent snippet that this chunk corresponds to. It is
only present if `instant_mode` is set to `false`.
utterance_index:
type:
- integer
- 'null'
description: The index of the utterance in the request that the parent snippet of this chunk corresponds to.
required:
- type
- audio
- audio_format
- chunk_index
- generation_id
- is_last_chunk
- request_id
- snippet_id
- text
- transcribed_text
- utterance_index
description: SnippetAudioChunk variant
- type: object
properties:
type:
type: string
enum:
- timestamp
description: 'Discriminator value: timestamp'
generation_id:
type: string
format: uuid4
description: The generation ID of the parent snippet that this chunk corresponds to.
request_id:
type: string
description: ID of the initiating request.
snippet_id:
type: string
format: uuid4
description: The ID of the parent snippet that this chunk corresponds to.
timestamp:
$ref: '#/components/schemas/Timestamp-Output'
description: A word or phoneme level timestamp for the generated audio.
required:
- type
- generation_id
- request_id
- snippet_id
- timestamp
description: OctaveOutputTimestamp variant
discriminator:
propertyName: type
title: TtsOutput
ValidationErrorLocItems:
oneOf:
- type: string
- type: integer
title: ValidationErrorLocItems
ValidationError:
type: object
properties:
loc:
type: array
items:
$ref: '#/components/schemas/ValidationErrorLocItems'
msg:
type: string
type:
type: string
required:
- loc
- msg
- type
title: ValidationError
HTTPValidationError:
type: object
properties:
detail:
type: array
items:
$ref: '#/components/schemas/ValidationError'
title: HTTPValidationError
OctaveBodyArgsContext:
oneOf:
- $ref: '#/components/schemas/ContextGenerationId'
- $ref: '#/components/schemas/ContextUtterances'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple requests. These
will not be converted to speech output.
title: OctaveBodyArgsContext
Format:
oneOf:
- type: object
properties:
type:
type: string
enum:
- mp3
description: 'Discriminator value: mp3'
required:
- type
description: Mp3Format variant
- type: object
properties:
type:
type: string
enum:
- pcm
description: 'Discriminator value: pcm'
required:
- type
description: PcmFormat variant
- type: object
properties:
type:
type: string
enum:
- wav
description: 'Discriminator value: wav'
required:
- type
description: WavFormat variant
discriminator:
propertyName: type
description: Specifies the output audio file format.
title: Format
OctaveBodyArgs:
type: object
properties:
context:
oneOf:
- $ref: '#/components/schemas/OctaveBodyArgsContext'
- type: 'null'
description: >-
Utterances to use as context for generating consistent speech style and prosody across multiple requests.
These will not be converted to speech output.
format:
$ref: '#/components/schemas/Format'
description: Specifies the output audio file format.
include_timestamp_types:
type: array
items:
$ref: '#/components/schemas/TimestampType'
description: The set of timestamp types to include in the response. Only supported for Octave 2 requests.
num_generations:
type: integer
default: 1
description: >-
Number of audio generations to produce from the input utterances.
Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally,
specifying `num_generations` allows prosody continuation across all generations without repeating context,
ensuring each generation sounds slightly different while maintaining contextual consistency.
split_utterances:
type: boolean
default: true
description: >-
Controls how audio output is segmented in the response.
- When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments.
- When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances
and output snippets.
This setting affects how the `snippets` array is structured in the response, which may be important for
applications that need to track the relationship between input text and generated audio segments. When
setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
strip_headers:
type: boolean
default: false
description: >-
If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a
single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own
headers (if ap
# --- truncated at 32 KB (36 KB total) ---
# Full source: https://raw.githubusercontent.com/api-evangelist/hume-ai/refs/heads/main/openapi/hume-ai-tts-openapi.yml