Lambda Inference API

Lambda Inference API is an OpenAI-compatible REST gateway at https://api.lambda.ai/v1 that serves hosted open-source language models (Llama, DeepSeek, Hermes, Qwen, and others) behind the standard OpenAI Chat Completions surface. Chat completion responses can be streamed as HTTP Server-Sent Events by setting "stream":true on the POST /chat/completions request body; the SSE stream emits chat.completion.chunk events terminated by a data [DONE] sentinel. As of 2026-05-29 Lambda has announced the Inference API is winding down in favor of customer self-hosted deployments on Lambda GPU instances.

AsyncAPI Specification

lambda-labs-asyncapi.yml Raw ↑
asyncapi: '2.6.0'
id: 'urn:ai:lambda:inference:v1:chat-completions:sse'
info:
  title: Lambda Inference API Chat Completions Streaming (HTTP + SSE)
  version: '1.0.0'
  description: |
    AsyncAPI 2.6 description of the Lambda (formerly Lambda Labs) **Inference
    API** chat completion streaming surface.

    The Lambda Inference API is an OpenAI-compatible REST gateway hosted at
    `https://api.lambda.ai/v1`. Chat completions are issued by
    `POST /chat/completions` with a JSON body that follows the OpenAI Chat
    Completions schema. When the request body sets `stream: true`, the server
    responds with `Content-Type: text/event-stream` and emits a sequence of
    Server-Sent Events whose `data:` payloads each carry one
    `chat.completion.chunk` JSON object, followed by a final `data: [DONE]`
    sentinel that marks end of stream. This SSE behavior is inherited from
    the OpenAI Chat Completions contract that Lambda advertises full
    compatibility with.

    SSE is a one-way, server-to-client HTTP streaming channel; it is
    **not** WebSocket. Lambda does not publish a WebSocket, MQTT, AMQP,
    Kafka, or webhook surface for inference. This AsyncAPI document models
    only the streamed events emitted on the SSE response. The request body
    fields (model, messages, temperature, max_tokens, tools, etc.) belong
    to the synchronous request side and are out of scope here; the parent
    REST surface is cataloged separately.

    Status note (verified 2026-05-29 on https://lambda.ai/inference): Lambda
    has announced that the Inference API is winding down in favor of customer
    self-hosted deployments on Lambda GPU instances. No end-of-life date was
    published at the time of authoring; the SSE contract described here
    remains active while the service is operational.

    Only fields and behaviors that Lambda explicitly advertises as
    OpenAI-compatible are modeled. Provider-proprietary metadata (e.g.
    Groq-style `x_groq`, Together-style `usage` sidecars) is intentionally
    not invented for Lambda; if Lambda later publishes proprietary stream
    extensions, they should be added here against a primary source.
  contact:
    name: API Evangelist
    email: [email protected]
    url: https://apievangelist.com
  license:
    name: API documentation - Lambda Terms of Service
    url: https://lambda.ai/legal/terms-of-service
  x-transport-notes:
    transport: HTTP Server-Sent Events (SSE)
    protocol: https
    direction: server-to-client (one-way)
    mediaType: text/event-stream
    triggeredBy: 'POST https://api.lambda.ai/v1/chat/completions with request body { "stream": true }'
    terminator: 'data: [DONE]'
    notWebSocket: true
    openAiCompatible: true
    sources:
      - https://docs.lambda.ai/public-cloud/lambda-inference-api/
      - https://lambda.ai/inference
      - https://lambda.ai/blog/deepseek-r1-0528-on-lambda-inference-api
  x-status:
    lifecycle: winding-down
    sourceUrl: https://lambda.ai/inference
    quote: 'As the Inference API winds down, you can continue deploying and scaling models seamlessly on NVIDIA GPU instances.'
defaultContentType: text/event-stream
servers:
  inference:
    url: api.lambda.ai/v1
    protocol: https
    description: |
      Lambda Inference API OpenAI-compatible REST base. Chat completion
      streaming is delivered as HTTP Server-Sent Events over this base when
      `stream: true` is set on the JSON request body. AsyncAPI 2.6 does not
      define a dedicated SSE protocol identifier; `https` is used here and
      the SSE transport is documented in `info.x-transport-notes` and on
      each channel.
    security:
      - bearerAuth: []
channels:
  /chat/completions:
    description: |
      Chat completion SSE stream. The client opens this channel by issuing
      `POST /chat/completions` with `Content-Type: application/json`,
      `Authorization: Bearer <LAMBDA_API_KEY>`, and a JSON body containing
      `stream: true`. The server responds with
      `Content-Type: text/event-stream` and emits a sequence of `data:`
      lines, each carrying one JSON-serialized `chat.completion.chunk`
      object, followed by a final `data: [DONE]` line that terminates the
      stream.
    bindings:
      http:
        type: request
        method: POST
        bindingVersion: '0.3.0'
      x-sse:
        mediaType: text/event-stream
        eventField: 'data'
        terminator: '[DONE]'
    subscribe:
      operationId: streamChatCompletionChunks
      summary: Subscribe to streamed chat completion chunks (SSE).
      description: |
        After `POST /chat/completions` is issued with `stream: true`, the
        server emits an ordered sequence of SSE `data:` events. Each
        `data:` line either carries a JSON-serialized `ChatCompletionChunk`
        or the literal sentinel `[DONE]` marking end of stream. The
        chunk shape is OpenAI-compatible per Lambda's advertised contract.
      bindings:
        http:
          type: response
          bindingVersion: '0.3.0'
      message:
        oneOf:
          - $ref: '#/components/messages/ChatCompletionChunk'
          - $ref: '#/components/messages/StreamDone'
components:
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer
      bearerFormat: 'Lambda API key'
      description: |
        Lambda Inference API bearer token. Set the
        `Authorization: Bearer <LAMBDA_API_KEY>` header on the
        `POST /chat/completions` request that opens the SSE stream. Keys
        are minted in the Lambda Cloud console.
  messages:
    ChatCompletionChunk:
      name: ChatCompletionChunk
      title: Streamed chat completion chunk
      summary: |
        A single SSE `data:` event carrying one JSON `chat.completion.chunk`
        object. Many of these are emitted per request, in order.
      contentType: application/json
      description: |
        Sent as `data: {json}\n\n` on the SSE stream. The JSON object's
        `object` field is the literal string `chat.completion.chunk`, per
        the OpenAI Chat Completions streaming contract that Lambda
        advertises full compatibility with.
      payload:
        $ref: '#/components/schemas/ChatCompletionChunk'
      examples:
        - name: openingChunk
          summary: First chunk - establishes assistant role
          payload:
            id: chatcmpl-lambda-abc123
            object: chat.completion.chunk
            created: 1748524800
            model: deepseek-r1-0528
            choices:
              - index: 0
                delta:
                  role: assistant
                  content: ''
                finish_reason: null
        - name: contentChunk
          summary: Token delta
          payload:
            id: chatcmpl-lambda-abc123
            object: chat.completion.chunk
            created: 1748524800
            model: deepseek-r1-0528
            choices:
              - index: 0
                delta:
                  content: 'Hello'
                finish_reason: null
        - name: finalChunk
          summary: Final chunk - finish_reason set
          payload:
            id: chatcmpl-lambda-abc123
            object: chat.completion.chunk
            created: 1748524800
            model: deepseek-r1-0528
            choices:
              - index: 0
                delta: {}
                finish_reason: stop
    StreamDone:
      name: StreamDone
      title: Stream terminator
      summary: |
        The literal SSE event `data: [DONE]` that marks end of stream. Not
        JSON; the payload is the string `[DONE]`.
      contentType: text/plain
      description: |
        Per the OpenAI Chat Completions streaming contract that Lambda
        advertises full compatibility with, the SSE stream is terminated
        by a `data: [DONE]` sentinel. Clients must stop reading the stream
        when this sentinel is observed.
      payload:
        $ref: '#/components/schemas/StreamDoneSentinel'
      examples:
        - name: done
          summary: End-of-stream sentinel
          payload: '[DONE]'
  schemas:
    StreamDoneSentinel:
      type: string
      enum:
        - '[DONE]'
      description: |
        End-of-stream sentinel. The full SSE line is `data: [DONE]`. The
        payload value modeled here is the string literal `[DONE]`.
    ChatCompletionChunk:
      type: object
      description: |
        Represents a streamed chunk of a chat completion response. Shape
        follows the OpenAI `chat.completion.chunk` schema that Lambda
        advertises compatibility with.
      required:
        - choices
        - created
        - id
        - model
        - object
      properties:
        id:
          type: string
          description: |
            A unique identifier for the chat completion. Each chunk in a
            given stream shares the same id.
        choices:
          type: array
          description: |
            A list of chat completion choices. Contains more than one
            element if the request set `n` greater than 1.
          items:
            $ref: '#/components/schemas/ChatCompletionChunkChoice'
        created:
          type: integer
          description: |
            Unix timestamp (seconds) of when the chat completion was
            created. Each chunk in a given stream shares the same timestamp.
        model:
          type: string
          description: |
            The model used to generate the completion (e.g.
            `deepseek-r1-0528`). The value echoes the `model` field of the
            request. Lambda's catalog of currently-served models is
            published at https://docs.lambda.ai/public-cloud/lambda-inference-api/.
        object:
          type: string
          enum:
            - chat.completion.chunk
          description: The object type, which is always `chat.completion.chunk`.
        system_fingerprint:
          type: string
          description: |
            Optional fingerprint of the backend configuration the model
            runs with, per the OpenAI Chat Completions streaming
            contract. Presence depends on the served model.
    ChatCompletionChunkChoice:
      type: object
      required:
        - delta
        - index
      properties:
        index:
          type: integer
          description: The index of the choice in the list of choices.
        delta:
          $ref: '#/components/schemas/ChatCompletionStreamResponseDelta'
        finish_reason:
          type: string
          nullable: true
          enum:
            - stop
            - length
            - tool_calls
            - content_filter
            - function_call
          description: |
            Reason the model stopped generating tokens, per the OpenAI
            Chat Completions streaming contract. Null on all chunks
            except the final content chunk of a given choice.
        logprobs:
          type: object
          nullable: true
          description: |
            Log probability information for the choice, present only when
            the request enabled logprobs and the served model supports
            them.
    ChatCompletionStreamResponseDelta:
      type: object
      description: |
        A chat completion delta generated by a streamed model response.
        Fields are the OpenAI-compatible subset Lambda advertises.
      properties:
        role:
          type: string
          enum:
            - system
            - user
            - assistant
            - tool
          description: |
            Role of the author of this message. Typically only emitted on
            the first chunk of a given choice.
        content:
          type: string
          nullable: true
          description: |
            Token slice for this chunk. May be an empty string on the
            opening chunk and on the final chunk that carries
            `finish_reason`.
        tool_calls:
          type: array
          description: |
            Streaming tool-call fragments, present only when the request
            and served model engage tool calling. Each item carries a
            delta of a single tool call indexed by `index`. The full
            argument string for a given tool call is assembled by
            concatenating `function.arguments` across successive chunks
            with the same `index`.
          items:
            $ref: '#/components/schemas/ChatCompletionMessageToolCallChunk'
    ChatCompletionMessageToolCallChunk:
      type: object
      required:
        - index
      properties:
        index:
          type: integer
          description: |
            Index of the tool call within the choice's tool_calls array.
        id:
          type: string
          description: The id of the tool call, emitted on the first chunk for that index.
        type:
          type: string
          enum:
            - function
          description: |
            The type of the tool. Only `function` is part of the
            OpenAI-compatible streaming contract Lambda advertises.
        function:
          type: object
          properties:
            name:
              type: string
              description: |
                The name of the function to call, typically emitted on
                the first chunk for that tool-call index.
            arguments:
              type: string
              description: |
                JSON-encoded arguments fragment. The full argument string
                is assembled by concatenating `function.arguments` across
                successive chunks with the same `index`. Intermediate
                states may be invalid JSON; validate after assembly.