Scalable Inference Serving

vLLM OpenAI-Compatible API

vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention for efficient KV cache management. vLLM exposes an OpenAI-compatible REST API allowing seamless migration from OpenAI endpoints. In 2026, vLLM integrates with KServe via LLMInferenceService and llm-d for production-grade distributed LLM inference. Powers major LLM deployments at scale.

Documentation GitHub

Documentation

📖

Documentation

https://docs.vllm.ai/en/stable/

📖

APIReference

https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html

Other Resources

🔗

GitHub

https://github.com/vllm-project/vllm

🔗

ChangeLog

https://github.com/vllm-project/vllm/releases

API entry from apis.yml

name: vLLM OpenAI-Compatible API
description: vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention
  for efficient KV cache management. vLLM exposes an OpenAI-compatible REST API allowing seamless migration
  from OpenAI endpoints. In 2026, vLLM integrates with KServe via LLMInferenceService and llm-d for production-grade
  distributed LLM inference. Powers major LLM deployments at scale.
image: https://docs.vllm.ai/en/stable/_static/logo/vllm-logo-text-light.png
humanUrl: https://docs.vllm.ai/
baseUrl: https://vllm.example.com/v1
tags:
- GPU
- Inference
- KV Cache
- LLM
- Model Serving
- Open Source
- OpenAI-Compatible
properties:
- type: Documentation
  url: https://docs.vllm.ai/en/stable/
- type: GitHub
  url: https://github.com/vllm-project/vllm
- type: APIReference
  url: https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html
- type: ChangeLog
  url: https://github.com/vllm-project/vllm/releases
contact:
- type: GitHub Issues
  url: https://github.com/vllm-project/vllm/issues
- type: Slack
  url: https://vllm-dev.slack.com/