DeepSeek R1 is a powerful open-weight AI model optimized for reasoning and coding tasks. Running it efficiently requires a high-performance inference engine, and vLLM v1 is one of the best choices due to its optimized memory management, fast execution, and seamless integration with Hugging Face models.
In this guide, we’ll walk through the process of installing and running DeepSeek R1 locally using vLLM v1 to achieve high-speed inference on consumer or enterprise GPUs.
vLLM is a fast and easy-to-use library for LLM inference and serving. It is a cutting-edge inference engine designed to:
Maximize throughput while reducing memory overhead
Support PagedAttention for efficient memory management
Enable quantization (FP16, BF16, and INT4) to run models on lower VRAM
Provide a simple OpenAI-compatible API server for serving models
vLLM V1 already achieves state-of-the-art performance and is set to gain even more optimizations. vLLM V1 is a major upgrade to vLLM’s core architecture. With vLLM v1, you can run DeepSeek R1 efficiently, even on GPUs with limited memory.
Before getting started, ensure you have the following:
A Linux-based OS (Ubuntu 20.04+ recommended)
Python 3.8+ installed
NVIDIA driver 525+, CUDA 11.8+ (for GPU acceleration)
A GPU with at least 16GB VRAM (e.g., RTX A4000, A5000, A6000, 2xRTX 4090, A100)
Next, we will use a physical server equipped with dual RTX 4090 cards to run DeepSeek-R1. Click here to view GPU server recommendations. Note that the exact process may vary depending on the specific DeepSeek R1 model you are using.
To keep things clean, create and activate a Python virtual environment:
python3 -m venv deepseek_env source deepseek_env/bin/activate
For example, on Ubuntu 22.04, Install vLLM along with its dependencies:
pip install vllm --upgrade
We can enable V1 seamlessly—just set the VLLM_USE_V1=1 environment variable without any changes to the existing API. Set the environment variable:
export VLLM_USE_V1=1
Use vLLM’s Python API or OpenAI-compatible server (vllm serve <model-name>). You don’t need any change to the existing API.
vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. You can start the server using Python:
python -m vllm.entrypoints.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --max_model 4096 --port 8000 --tensor-parallel-size 2
This command will fetch the DeepSeek R1 model and store it locally on your machine. Depending on your internet speed, this may take a few minutes. You can run DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Llama-8B and smaller models on 2xRTX 4090 GPU Server. The output is:
[output] INFO 02-08 05:27:51 __init__.py:190] Automatically detected platform cuda. INFO 02-08 05:27:53 api_server.py:120] vLLM API server version 0.7.2 …… INFO 02-08 05:29:09 core.py:91] init engine (profile, create kv cache, warmup model) took 60.66 seconds INFO 02-08 05:29:09 launcher.py:21] Available routes are: INFO 02-08 05:29:09 launcher.py:29] Route: /openapi.json, Methods: GET, HEAD INFO 02-08 05:29:09 launcher.py:29] Route: /docs, Methods: GET, HEAD INFO 02-08 05:29:09 launcher.py:29] Route: /docs/oauth2-redirect, Methods: GET, HEAD INFO 02-08 05:29:09 launcher.py:29] Route: /redoc, Methods: GET, HEAD INFO 02-08 05:29:09 launcher.py:29] Route: /health, Methods: GET INFO 02-08 05:29:09 launcher.py:29] Route: /generate, Methods: POST INFO: Started server process [169693] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
Another way, you can run the following command to start the vLLM server with the DeepSeek-R1-Distill-Qwen-14B model:
vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" --max_model 4096 --port 8000 --tensor-parallel-size 2
The server currently hosts one model at a time and implements endpoints such as list models, create chat completion, and create completion endpoints.
Note:
--max-model-len, Model context length. If unspecified, will be automatically derived from the model config.
--tensor-parallel-size, -tp, Number of tensor parallel replicas. Default: 1.
Now, you have a REST API running locally at http://localhost:8000. This server can be queried in the same format as OpenAI API. For example, to list the models:
$ curl http://localhost:8000/v1/models {"object":"list","data":[{"id":"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B","object":"model","created":1738995551,"owned_by":"vllm","root":"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-1ca751a7c2b14b67971555c249c610c2","object":"model_permission","created":1738995551,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"
You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for API key in the header.
vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.
You can vist http://localhost:8000/ to view the API interfaces supported by vLLM:
You can use the create chat completion endpoint to interact with the model:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'
Alternatively, you can use the openai Python package:
from openai import OpenAI # Set OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) chat_response = client.chat.completions.create( model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a joke."}, ] ) print("Chat response:", chat_response)
Running DeepSeek R1 locally with vLLM v1 enables efficient, high-performance inference for AI applications. Whether you're developing AI-powered tools or experimenting with reasoning models, this setup ensures speed and flexibility. For those interested in self-hosted AI, explore our GPU server leasing solutions!