How to Install and Run DeepSeek R1 Locally With vLLM V1

Hot deal! Get up to 53% OFF – As Low As $18.33/Month!>



Introduction

DeepSeek R1 is a powerful open-weight AI model optimized for reasoning and coding tasks. Running it efficiently requires a high-performance inference engine, and vLLM v1 is one of the best choices due to its optimized memory management, fast execution, and seamless integration with Hugging Face models.

In this guide, we’ll walk through the process of installing and running DeepSeek R1 locally using vLLM v1 to achieve high-speed inference on consumer or enterprise GPUs.

Why Use vLLM v1?

vLLM is a fast and easy-to-use library for LLM inference and serving. It is a cutting-edge inference engine designed to:

Maximize throughput while reducing memory overhead

Support PagedAttention for efficient memory management

Enable quantization (FP16, BF16, and INT4) to run models on lower VRAM

Provide a simple OpenAI-compatible API server for serving models

vLLM V1 already achieves state-of-the-art performance and is set to gain even more optimizations. vLLM V1 is a major upgrade to vLLM’s core architecture. With vLLM v1, you can run DeepSeek R1 efficiently, even on GPUs with limited memory.

Prerequisites

Before getting started, ensure you have the following:

A Linux-based OS (Ubuntu 20.04+ recommended)

Python 3.9 – 3.12

NVIDIA driver 525+, CUDA 11.8+ (for GPU acceleration)

NVIDIA GPU: compute capability 7.0 or higher (e.g., V100, RTX A6000, RTX 4090, A100, H100, etc.)

How to Install vLLM v1 and Run DeepSeek R1 with it?

Next, we will use a physical server equipped with dual RTX 4090 cards to run DeepSeek-R1. Click here to view GPU server recommendations. Note that the exact process may vary depending on the specific DeepSeek R1 model you are using.

Step 1 - Set Up a Virtual Environment

To keep things clean, create and activate a Python virtual environment:

python3 -m venv deepseek_env
source deepseek_env/bin/activate

Step 1 - Install vLLM

For example, on Ubuntu 22.04, Install vLLM along with its dependencies:

pip install vllm --upgrade

We can enable V1 seamlessly—just set the VLLM_USE_V1=1 environment variable without any changes to the existing API. Set the environment variable:

export VLLM_USE_V1=1

Use vLLM’s Python API or OpenAI-compatible server (vllm serve <model-name>). You don’t need any change to the existing API.

Step 2 - Download and Run DeepSeek R1

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. By default, it starts the server at http://localhost:8000. You can specify the address with --host and --port arguments. You can start the server using Python:

python -m vllm.entrypoints.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-14B --max_model 4096 --port 8000 --tensor-parallel-size 2

This command will fetch the DeepSeek R1 model and store it locally on your machine. Depending on your internet speed, this may take a few minutes. You can run DeepSeek-R1-Distill-Qwen-14B, DeepSeek-R1-Distill-Llama-8B and smaller models on 2xRTX 4090 GPU Server. The output is:

[output]
INFO 02-08 05:27:51 __init__.py:190] Automatically detected platform cuda.
INFO 02-08 05:27:53 api_server.py:120] vLLM API server version 0.7.2
……
INFO 02-08 05:29:09 core.py:91] init engine (profile, create kv cache, warmup model) took 60.66 seconds
INFO 02-08 05:29:09 launcher.py:21] Available routes are:
INFO 02-08 05:29:09 launcher.py:29] Route: /openapi.json, Methods: GET, HEAD
INFO 02-08 05:29:09 launcher.py:29] Route: /docs, Methods: GET, HEAD
INFO 02-08 05:29:09 launcher.py:29] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 02-08 05:29:09 launcher.py:29] Route: /redoc, Methods: GET, HEAD
INFO 02-08 05:29:09 launcher.py:29] Route: /health, Methods: GET
INFO 02-08 05:29:09 launcher.py:29] Route: /generate, Methods: POST
INFO:     Started server process [169693]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

Another way, you can run the following command to start the vLLM server with the DeepSeek-R1-Distill-Qwen-14B model:

vllm serve "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B" --max_model 4096 --port 8000 --tensor-parallel-size 2

The server currently hosts one model at a time and implements endpoints such as list models, create chat completion, and create completion endpoints.

Note:
--max-model-len, Model context length. If unspecified, will be automatically derived from the model config.
--tensor-parallel-size, -tp, Number of tensor parallel replicas. Default: 1.

Step 3 - Query the API

Now, you have a REST API running locally at http://localhost:8000. This server can be queried in the same format as OpenAI API. For example, to list the models:

$ curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B","object":"model","created":1738995551,"owned_by":"vllm","root":"deepseek-ai/DeepSeek-R1-Distill-Qwen-14B","parent":null,"max_model_len":4096,"permission":[{"id":"modelperm-1ca751a7c2b14b67971555c249c610c2","object":"model_permission","created":1738995551,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"

You can pass in the argument --api-key or environment variable VLLM_API_KEY to enable the server to check for API key in the header.

How to Use DeepSeek R1 on vLLM Server?

vLLM is designed to also support the OpenAI Chat Completions API. The chat interface is a more dynamic, interactive way to communicate with the model, allowing back-and-forth exchanges that can be stored in the chat history. This is useful for tasks that require context or more detailed explanations.

1. Check OpenAI Completions API with vLLM

You can vist http://localhost:8000/ to view the API interfaces supported by vLLM:

2. Accessing DeepSeek-R1 via API

You can use the create chat completion endpoint to interact with the model:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

3. Accessing DeepSeek-R1 via Python

Alternatively, you can use the openai Python package:

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

Conclusion

Running DeepSeek R1 locally with vLLM v1 enables efficient, high-performance inference for AI applications. Whether you're developing AI-powered tools or experimenting with reasoning models, this setup ensures speed and flexibility. For those interested in self-hosted AI, explore our GPU server leasing solutions!