Hume Octave Text-to-Speech API

REST API for synthesizing expressive speech using Octave. Supports streamed JSON/file and standard JSON/file responses, plus voice conversion endpoints.

OpenAPI Specification

hume-ai-tts-openapi.yml Raw ↑
openapi: 3.1.0
info:
  title: Text-to-Speech (TTS)
  version: 1.0.0
paths:
  /v0/tts/stream/json:
    post:
      operationId: synthesize-json-streaming
      summary: Text-to-Speech (Streamed JSON)
      description: >-
        Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated
        dynamically. Optionally, additional context can be included to influence the speech's style and prosody. 


        The response is a stream of JSON objects including audio encoded in base64.
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Successful Response
          content:
            text/event-stream:
              schema:
                $ref: '#/components/schemas/TtsOutput'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/OctaveBodyArgsStream'
  /v0/tts/stream/file:
    post:
      operationId: synthesize-file-streaming
      summary: Text-to-Speech (Streamed File)
      description: >-
        Streams synthesized speech using the specified voice. If no voice is provided, a novel voice will be generated
        dynamically. Optionally, additional context can be included to influence the speech's style and prosody.
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: OK
          content:
            application/octet-stream:
              schema:
                type: string
                format: binary
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/OctaveBodyArgsStream'
  /v0/tts:
    post:
      operationId: synthesize-json
      summary: Text-to-Speech (Json)
      description: >-
        Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel
        voice will be generated dynamically. Optionally, additional context can be included to influence the speech's
        style and prosody.


        The response includes the base64-encoded audio and metadata in JSON format.
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Successful Response
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/OctaveResponse'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/OctaveBodyArgs'
  /v0/tts/file:
    post:
      operationId: synthesize-file
      summary: Text-to-Speech (File)
      description: >-
        Synthesizes one or more input texts into speech using the specified voice. If no voice is provided, a novel
        voice will be generated dynamically. Optionally, additional context can be included to influence the speech's
        style and prosody. 


        The response contains the generated audio file in the requested format.
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: OK
          content:
            application/octet-stream:
              schema:
                type: string
                format: binary
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/OctaveBodyArgs'
  /v0/tts/voice_conversion/file:
    post:
      operationId: convert-voice-file
      summary: Voice Conversion (Streamed File)
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Successful Response
          content:
            application/octet-stream:
              schema:
                type: string
                format: binary
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          multipart/form-data:
            schema:
              type: object
              properties:
                strip_headers:
                  type: boolean
                  description: >-
                    If enabled, the audio for all the chunks of a generation, once concatenated together, will
                    constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio
                    file, each with its own headers (if applicable).
                audio:
                  type: string
                  format: binary
                  description: >-
                    Audio file containing speech to be converted to the target voice. Supported formats include `MP3`,
                    `WAV`, `M4A`, and `OGG`.
                context:
                  oneOf:
                    - $ref: >-
                        #/components/schemas/V0TtsVoiceConversionFilePostRequestBodyContentMultipartFormDataSchemaContext
                    - type: 'null'
                  description: >-
                    Utterances to use as context for generating consistent speech style and prosody across multiple
                    requests. These will not be converted to speech output.
                voice:
                  $ref: '#/components/schemas/VoiceRef'
                format:
                  $ref: '#/components/schemas/Format'
                  description: Specifies the output audio file format.
                include_timestamp_types:
                  type: array
                  items:
                    $ref: '#/components/schemas/TimestampType'
                  description: >-
                    The set of timestamp types to include in the response. When used in multipart/form-data, specify
                    each value using bracket notation:
                    `include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2
                    requests.
              required:
                - audio
  /v0/tts/voice_conversion/json:
    post:
      operationId: convert-voice-json
      summary: Voice Conversion (Streamed JSON)
      tags:
        - ''
      parameters:
        - name: X-Hume-Api-Key
          in: header
          required: true
          schema:
            type: string
      responses:
        '200':
          description: Successful Response
          content:
            text/event-stream:
              schema:
                $ref: '#/components/schemas/TtsOutput'
        '422':
          description: Validation Error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/HTTPValidationError'
      requestBody:
        content:
          multipart/form-data:
            schema:
              type: object
              properties:
                strip_headers:
                  type: boolean
                  description: >-
                    If enabled, the audio for all the chunks of a generation, once concatenated together, will
                    constitute a single audio file. Otherwise, if disabled, each chunk's audio will be its own audio
                    file, each with its own headers (if applicable).
                audio:
                  type: string
                  format: binary
                  description: >-
                    Audio file containing speech to be converted to the target voice. Supported formats include `MP3`,
                    `WAV`, `M4A`, and `OGG`.
                context:
                  oneOf:
                    - $ref: >-
                        #/components/schemas/V0TtsVoiceConversionJsonPostRequestBodyContentMultipartFormDataSchemaContext
                    - type: 'null'
                  description: >-
                    Utterances to use as context for generating consistent speech style and prosody across multiple
                    requests. These will not be converted to speech output.
                voice:
                  $ref: '#/components/schemas/VoiceRef'
                format:
                  $ref: '#/components/schemas/Format'
                  description: Specifies the output audio file format.
                include_timestamp_types:
                  type: array
                  items:
                    $ref: '#/components/schemas/TimestampType'
                  description: >-
                    The set of timestamp types to include in the response. When used in multipart/form-data, specify
                    each value using bracket notation:
                    `include_timestamp_types[0]=word&include_timestamp_types[1]=phoneme`. Only supported for Octave 2
                    requests.
servers:
  - url: https://api.hume.ai
components:
  schemas:
    ContextGenerationId:
      type: object
      properties:
        generation_id:
          type: string
          format: uuid4
          description: >-
            The ID of a prior TTS generation to use as context for generating consistent speech style and prosody across
            multiple requests. Including context may increase audio generation times.
      required:
        - generation_id
      title: ContextGenerationId
    VoiceProvider:
      type: string
      enum:
        - HUME_AI
        - CUSTOM_VOICE
      title: VoiceProvider
    VoiceId:
      type: object
      properties:
        id:
          type: string
          description: The unique ID associated with the **Voice**.
        provider:
          $ref: '#/components/schemas/VoiceProvider'
          description: >-
            Specifies the source provider associated with the chosen voice.


            - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library),
            containing a variety of preset, shared voices.

            - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. 


            If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's
            **Voice Library**, you must explicitly set the provider to `HUME_AI`.


            Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are
            private and accessible only via requests authenticated with your API key.
      required:
        - id
      title: VoiceId
    VoiceName:
      type: object
      properties:
        name:
          type: string
          description: The name of a **Voice**.
        provider:
          $ref: '#/components/schemas/VoiceProvider'
          description: >-
            Specifies the source provider associated with the chosen voice.


            - **`HUME_AI`**: Select voices from Hume's [Voice Library](https://app.hume.ai/tts/voice-library),
            containing a variety of preset, shared voices.

            - **`CUSTOM_VOICE`**: Select from voices you've personally generated and saved in your account. 


            If no provider is explicitly set, the default provider is `CUSTOM_VOICE`. When using voices from Hume's
            **Voice Library**, you must explicitly set the provider to `HUME_AI`.


            Preset voices from Hume's **Voice Library** are accessible by all users. In contrast, your custom voices are
            private and accessible only via requests authenticated with your API key.
      required:
        - name
      title: VoiceName
    VoiceRef:
      oneOf:
        - $ref: '#/components/schemas/VoiceId'
        - $ref: '#/components/schemas/VoiceName'
      title: VoiceRef
    Utterance:
      type: object
      properties:
        description:
          type:
            - string
            - 'null'
          description: >-
            Natural language instructions describing how the synthesized speech should sound, including but not limited
            to tone, intonation, pacing, and accent.


            **This field behaves differently depending on whether a voice is specified**:

            - **Voice specified**: the description will serve as acting directions for delivery. Keep directions
            concise—100 characters or fewer—for best results. See our guide on [acting
            instructions](/docs/text-to-speech-tts/acting-instructions).

            - **Voice not specified**: the description will serve as a voice prompt for generating a voice. See our
            [prompting guide](/docs/text-to-speech-tts/prompting) for design tips.
        speed:
          type: number
          format: double
          default: 1
          description: >-
            Speed multiplier for the synthesized speech. Extreme values below 0.75 and above 1.5 may sometimes cause
            instability to the generated output.
        text:
          type: string
          description: The input text to be synthesized into speech.
        trailing_silence:
          type: number
          format: double
          default: 0
          description: Duration of trailing silence (in seconds) to add to this utterance
        voice:
          oneOf:
            - $ref: '#/components/schemas/VoiceRef'
            - type: 'null'
          description: >-
            The `name` or `id` associated with a **Voice** from the **Voice Library** to be used as the speaker for this
            and all subsequent `utterances`, until the `voice` field is updated again.

             See our [voices guide](/docs/text-to-speech-tts/voices) for more details on generating and specifying **Voices**.
      required:
        - text
      title: Utterance
    ContextUtterances:
      type: object
      properties:
        utterances:
          type: array
          items:
            $ref: '#/components/schemas/Utterance'
      required:
        - utterances
      title: ContextUtterances
    OctaveBodyArgsStreamContext:
      oneOf:
        - $ref: '#/components/schemas/ContextGenerationId'
        - $ref: '#/components/schemas/ContextUtterances'
      description: >-
        Utterances to use as context for generating consistent speech style and prosody across multiple requests. These
        will not be converted to speech output.
      title: OctaveBodyArgsStreamContext
    OctaveBodyArgsStreamFormat:
      oneOf:
        - type: object
          properties:
            type:
              type: string
              enum:
                - mp3
              description: 'Discriminator value: mp3'
          required:
            - type
          description: Mp3Format variant
        - type: object
          properties:
            type:
              type: string
              enum:
                - pcm
              description: 'Discriminator value: pcm'
          required:
            - type
          description: PcmFormat variant
        - type: object
          properties:
            type:
              type: string
              enum:
                - wav
              description: 'Discriminator value: wav'
          required:
            - type
          description: WavFormat variant
      discriminator:
        propertyName: type
      description: Specifies the output audio file format.
      title: OctaveBodyArgsStreamFormat
    TimestampType:
      type: string
      enum:
        - word
        - phoneme
      title: TimestampType
    OctaveVersion:
      type: string
      enum:
        - '1'
        - '2'
      description: |-
        Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume
        automatically routes the request to the most appropriate model. Setting a specific version ensures stable and
        repeatable behavior across requests.

        Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a
        `voice`. Requests that set `version: 2` without a voice will be rejected.

        For a comparison of Octave versions, see the
        [Octave versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.
      title: OctaveVersion
    OctaveBodyArgsStream:
      type: object
      properties:
        context:
          oneOf:
            - $ref: '#/components/schemas/OctaveBodyArgsStreamContext'
            - type: 'null'
          description: >-
            Utterances to use as context for generating consistent speech style and prosody across multiple requests.
            These will not be converted to speech output.
        format:
          $ref: '#/components/schemas/OctaveBodyArgsStreamFormat'
          description: Specifies the output audio file format.
        include_timestamp_types:
          type: array
          items:
            $ref: '#/components/schemas/TimestampType'
          description: The set of timestamp types to include in the response. Only supported for Octave 2 requests.
        instant_mode:
          type: boolean
          default: true
          description: >-
            Enables ultra-low latency streaming, significantly reducing the time until the first audio chunk is
            received. Recommended for real-time applications requiring immediate audio playback. For further details,
            see our documentation on [instant
            mode](/docs/text-to-speech-tts/overview#ultra-low-latency-streaming-instant-mode). 

            - A [voice](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.utterances.voice) must be
            specified when instant mode is enabled. Dynamic voice generation is not supported with this mode.

            - Instant mode is only supported for streaming endpoints (e.g.,
            [/v0/tts/stream/json](/reference/text-to-speech-tts/synthesize-json-streaming),
            [/v0/tts/stream/file](/reference/text-to-speech-tts/synthesize-file-streaming)).

            - Ensure only a single generation is requested
            ([num_generations](/reference/text-to-speech-tts/synthesize-json-streaming#request.body.num_generations)
            must be `1` or omitted).
        num_generations:
          type: integer
          default: 1
          description: >-
            Number of audio generations to produce from the input utterances.


            Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally,
            specifying `num_generations` allows prosody continuation across all generations without repeating context,
            ensuring each generation sounds slightly different while maintaining contextual consistency.
        split_utterances:
          type: boolean
          default: true
          description: >-
            Controls how audio output is segmented in the response.


            - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments.


            - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances
            and output snippets. 


            This setting affects how the `snippets` array is structured in the response, which may be important for
            applications that need to track the relationship between input text and generated audio segments. When
            setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
        strip_headers:
          type: boolean
          default: false
          description: >-
            If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a
            single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own
            headers (if applicable).
        utterances:
          type: array
          items:
            $ref: '#/components/schemas/Utterance'
          description: >-
            A list of **Utterances** to be converted to speech output.


            An **Utterance** is a unit of input for [Octave](/docs/text-to-speech-tts/overview), and includes input
            `text`, an optional `description` to serve as the prompt for how the speech should be delivered, an optional
            `voice` specification, and additional controls to guide delivery for `speed` and `trailing_silence`.
        version:
          $ref: '#/components/schemas/OctaveVersion'
          description: >-
            Selects the Octave model version used to synthesize speech for this request. If you omit this field, Hume
            automatically routes the request to the most appropriate model. Setting a specific version ensures stable
            and repeatable behavior across requests.


            Use `2` to opt into the latest Octave capabilities. When you specify version `2`, you must also provide a
            `voice`. Requests that set `version: 2` without a voice will be rejected.


            For a comparison of Octave versions, see the [Octave
            versions](/docs/text-to-speech-tts/overview#octave-versions) section in the TTS overview.
      required:
        - utterances
      title: OctaveBodyArgsStream
    FormatType:
      type: string
      enum:
        - mp3
        - pcm
        - wav
      title: FormatType
    MillisecondInterval:
      type: object
      properties:
        begin:
          type: integer
          description: Start time of the interval in milliseconds.
        end:
          type: integer
          description: End time of the interval in milliseconds.
      required:
        - begin
        - end
      title: MillisecondInterval
    Timestamp-Output:
      type: object
      properties:
        text:
          type: string
          description: The word or phoneme text that the timestamp corresponds to.
        time:
          $ref: '#/components/schemas/MillisecondInterval'
          description: The start and end timestamps for the word or phoneme in milliseconds.
        type:
          $ref: '#/components/schemas/TimestampType'
      required:
        - text
        - time
        - type
      title: Timestamp-Output
    Snippet-Output:
      type: object
      properties:
        audio:
          type: string
          description: The segmented audio output in the requested format, encoded as a base64 string.
        generation_id:
          type: string
          format: uuid4
          description: The generation ID this snippet corresponds to.
        id:
          type: string
          format: uuid4
          description: A unique ID associated with this **Snippet**.
        text:
          type: string
          description: The text for this **Snippet**.
        timestamps:
          type: array
          items:
            $ref: '#/components/schemas/Timestamp-Output'
          description: >-
            A list of word or phoneme level timestamps for the generated audio. Timestamps are only returned for Octave
            2 requests.
        transcribed_text:
          type:
            - string
            - 'null'
          description: The transcribed text of the generated audio. It is only present if `instant_mode` is set to `false`.
        utterance_index:
          type:
            - integer
            - 'null'
          description: The index of the utterance in the request this snippet corresponds to.
      required:
        - audio
        - generation_id
        - id
        - text
        - timestamps
        - transcribed_text
        - utterance_index
      title: Snippet-Output
    TtsOutput:
      oneOf:
        - type: object
          properties:
            type:
              type: string
              enum:
                - audio
              description: 'Discriminator value: audio'
            audio:
              type: string
              description: The generated audio output chunk in the requested format.
            audio_format:
              $ref: '#/components/schemas/FormatType'
              description: The generated audio output format.
            chunk_index:
              type: integer
              description: The index of the audio chunk in the snippet.
            generation_id:
              type: string
              format: uuid4
              description: The generation ID of the parent snippet that this chunk corresponds to.
            is_last_chunk:
              type: boolean
              description: Whether or not this is the last chunk streamed back from the decoder for one input snippet.
            request_id:
              type: string
              description: ID of the initiating request.
            snippet:
              $ref: '#/components/schemas/Snippet-Output'
            snippet_id:
              type: string
              format: uuid4
              description: The ID of the parent snippet that this chunk corresponds to.
            text:
              type: string
              description: The text of the parent snippet that this chunk corresponds to.
            transcribed_text:
              type:
                - string
                - 'null'
              description: >-
                The transcribed text of the generated audio of the parent snippet that this chunk corresponds to. It is
                only present if `instant_mode` is set to `false`.
            utterance_index:
              type:
                - integer
                - 'null'
              description: The index of the utterance in the request that the parent snippet of this chunk corresponds to.
          required:
            - type
            - audio
            - audio_format
            - chunk_index
            - generation_id
            - is_last_chunk
            - request_id
            - snippet_id
            - text
            - transcribed_text
            - utterance_index
          description: SnippetAudioChunk variant
        - type: object
          properties:
            type:
              type: string
              enum:
                - timestamp
              description: 'Discriminator value: timestamp'
            generation_id:
              type: string
              format: uuid4
              description: The generation ID of the parent snippet that this chunk corresponds to.
            request_id:
              type: string
              description: ID of the initiating request.
            snippet_id:
              type: string
              format: uuid4
              description: The ID of the parent snippet that this chunk corresponds to.
            timestamp:
              $ref: '#/components/schemas/Timestamp-Output'
              description: A word or phoneme level timestamp for the generated audio.
          required:
            - type
            - generation_id
            - request_id
            - snippet_id
            - timestamp
          description: OctaveOutputTimestamp variant
      discriminator:
        propertyName: type
      title: TtsOutput
    ValidationErrorLocItems:
      oneOf:
        - type: string
        - type: integer
      title: ValidationErrorLocItems
    ValidationError:
      type: object
      properties:
        loc:
          type: array
          items:
            $ref: '#/components/schemas/ValidationErrorLocItems'
        msg:
          type: string
        type:
          type: string
      required:
        - loc
        - msg
        - type
      title: ValidationError
    HTTPValidationError:
      type: object
      properties:
        detail:
          type: array
          items:
            $ref: '#/components/schemas/ValidationError'
      title: HTTPValidationError
    OctaveBodyArgsContext:
      oneOf:
        - $ref: '#/components/schemas/ContextGenerationId'
        - $ref: '#/components/schemas/ContextUtterances'
      description: >-
        Utterances to use as context for generating consistent speech style and prosody across multiple requests. These
        will not be converted to speech output.
      title: OctaveBodyArgsContext
    Format:
      oneOf:
        - type: object
          properties:
            type:
              type: string
              enum:
                - mp3
              description: 'Discriminator value: mp3'
          required:
            - type
          description: Mp3Format variant
        - type: object
          properties:
            type:
              type: string
              enum:
                - pcm
              description: 'Discriminator value: pcm'
          required:
            - type
          description: PcmFormat variant
        - type: object
          properties:
            type:
              type: string
              enum:
                - wav
              description: 'Discriminator value: wav'
          required:
            - type
          description: WavFormat variant
      discriminator:
        propertyName: type
      description: Specifies the output audio file format.
      title: Format
    OctaveBodyArgs:
      type: object
      properties:
        context:
          oneOf:
            - $ref: '#/components/schemas/OctaveBodyArgsContext'
            - type: 'null'
          description: >-
            Utterances to use as context for generating consistent speech style and prosody across multiple requests.
            These will not be converted to speech output.
        format:
          $ref: '#/components/schemas/Format'
          description: Specifies the output audio file format.
        include_timestamp_types:
          type: array
          items:
            $ref: '#/components/schemas/TimestampType'
          description: The set of timestamp types to include in the response. Only supported for Octave 2 requests.
        num_generations:
          type: integer
          default: 1
          description: >-
            Number of audio generations to produce from the input utterances.


            Using `num_generations` enables faster processing than issuing multiple sequential requests. Additionally,
            specifying `num_generations` allows prosody continuation across all generations without repeating context,
            ensuring each generation sounds slightly different while maintaining contextual consistency.
        split_utterances:
          type: boolean
          default: true
          description: >-
            Controls how audio output is segmented in the response.


            - When **enabled** (`true`), input utterances are automatically split into natural-sounding speech segments.


            - When **disabled** (`false`), the response maintains a strict one-to-one mapping between input utterances
            and output snippets. 


            This setting affects how the `snippets` array is structured in the response, which may be important for
            applications that need to track the relationship between input text and generated audio segments. When
            setting to `false`, avoid including utterances with long `text`, as this can result in distorted output.
        strip_headers:
          type: boolean
          default: false
          description: >-
            If enabled, the audio for all the chunks of a generation, once concatenated together, will constitute a
            single audio file. Otherwise, if disabled, each chunk's audio will be its own audio file, each with its own
            headers (if ap

# --- truncated at 32 KB (36 KB total) ---
# Full source: https://raw.githubusercontent.com/api-evangelist/hume-ai/refs/heads/main/openapi/hume-ai-tts-openapi.yml