DeepSeek Open Weights: Self-Hosting & Fine-Tuning Guide
Master DeepSeek V4 self-hosting with open weights from HuggingFace. When to self-host vs use the API, vLLM deployment, quantization (GGUF/AWQ/GPTQ), and LoRA/QLoRA fine-tuning workflows.
DeepSeek V4 Pro and Flash are fully open-weight — available on HuggingFace with a permissive license. The same models that power the API can run on your own hardware with full control over inference parameters, quantization settings, and fine-tuning. For organizations with data sovereignty requirements, predictable workloads, or custom domain adaptation needs, self-hosting is viable.
But V4 Pro is 1.6T total parameters (49B active via Mixture of Experts). This is not a weekend Raspberry Pi project. The pages below cover the practical realities of self-hosting — from GPU requirements to fine-tuning workflows.
Self-Hosting vs API: Decision Framework
Choose the API when:
- Monthly token volume is under 50M (break-even is ~50M tokens/month for Pro on dedicated GPUs)
- You need burst capacity (API handles spikes without idle GPU costs)
- You don't want to manage infrastructure
- You're prototyping or have unpredictable workloads
Choose self-hosting when:
- Data sovereignty requires on-premise processing
- Monthly volume is high and predictable (>100M tokens/month)
- You need custom fine-tuning with proprietary data
- Latency requirements can't accept API network overhead
- You need guaranteed capacity (no rate limits)
GPU Requirements
V4 Pro (1.6T total / 49B active MoE)
| Precision | VRAM Required | Recommended GPUs |
|---|---|---|
| FP16 | ~100GB | 2× A100 80GB, 4× A6000 48GB |
| INT8 (AWQ) | ~50GB | 1× A100 80GB, 2× A6000 48GB |
| INT4 (GPTQ) | ~25GB | 1× A6000 48GB, 1× RTX 4090 24GB |
| Q4_K_M (GGUF) | ~28GB | 1× A6000 48GB, 1× RTX 4090 24GB |
V4 Flash (284B total / 13B active MoE)
| Precision | VRAM Required | Recommended GPUs |
|---|---|---|
| FP16 | ~28GB | 1× A6000 48GB, 1× RTX 4090 24GB |
| INT8 (AWQ) | ~14GB | 1× RTX 4090 24GB, 1× RTX 3090 24GB |
| INT4 (GPTQ) | ~7GB | 1× RTX 3060 12GB, Apple M2 Max 32GB |
| Q4_K_M (GGUF) | ~8GB | 1× RTX 4060 Ti 16GB |
Deployment with vLLM
vLLM is the recommended inference engine for DeepSeek:
pip install vllm
# Serve V4 Flash (AWQ quantized)
vllm serve deepseek-ai/DeepSeek-V4-Flash-AWQ \
--max-model-len 131072 \
--gpu-memory-utilization 0.95
# Serve V4 Pro (FP16 on 2× A100)
vllm serve deepseek-ai/DeepSeek-V4-Pro \
--tensor-parallel-size 2 \
--max-model-len 131072
vLLM Configuration Tips
# Optimize for throughput
--max-num-seqs 256 # Concurrent sequences
--max-model-len 131072 # Limit context to manage VRAM
--gpu-memory-utilization 0.95 # Use nearly all GPU memory
--enable-prefix-caching # Enable prefix caching (similar to API)
--enable-chunked-prefill # Better throughput for long prompts
# Optimize for latency
--max-num-seqs 8 # Fewer concurrent sequences
--max-num-batched-tokens 8192 # Smaller batches
Quantization Options
AWQ (Recommended for GPU inference)
Best quality-to-performance ratio for GPU deployment:
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "deepseek-ai/DeepSeek-V4-Flash"
quant_path = "DeepSeek-V4-Flash-AWQ"
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.quantize(tokenizer, quant_config={"zero_point": True, "q_group_size": 128})
model.save_quantized(quant_path)
GGUF (Recommended for local/CPU inference)
Best for running on consumer hardware or CPU-only:
# Convert to GGUF
python convert-hf-to-gguf.py DeepSeek-V4-Flash --outtype q4_k_m
# Run with llama.cpp
./llama-cli -m DeepSeek-V4-Flash-Q4_K_M.gguf \
-p "Explain quantum computing" \
-n 512 \
--ctx-size 32768
GPTQ (High compression)
Maximum compression for constrained VRAM:
from auto_gptq import AutoGPTQForCausalLM
model = AutoGPTQForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-V4-Flash",
quantize_config={"bits": 4, "group_size": 128}
)
Fine-Tuning
LoRA (Recommended for most use cases)
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-V4-Flash",
load_in_8bit=True, # Quantize for memory efficiency
device_map="auto"
)
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir="./deepseek-lora",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
model.save_pretrained("./deepseek-lora-final")
QLoRA (For consumer GPUs)
4-bit quantization + LoRA for fine-tuning on a single 24GB GPU:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-V4-Flash",
quantization_config=bnb_config,
device_map="auto"
)
# Then apply LoRA as above
Prompting Differences: Self-Hosted vs API
| Aspect | API | Self-Hosted |
|---|---|---|
| Thinking mode | Native support | Depends on inference engine |
| Context caching | Automatic, disk-based | Engine-dependent (vLLM: enable-prefix-caching) |
| Temperature control | Always available | Always available |
| System prompt | Fully supported | Fully supported |
| Tool calls | API handles | Must implement tool loop yourself |
| Quality | Full precision | Quality loss from quantization possible |
Self-Hosted Prompting Best Practices
-
Test quantized models thoroughly — INT4 can subtly degrade reasoning quality. Always benchmark quantized vs FP16 on your specific task.
-
Context caching isn't automatic — vLLM's
--enable-prefix-cachingprovides similar behavior, but other engines may not. -
Thinking mode may require engine support — Not all inference engines support DeepSeek's thinking mode. Test with a simple reasoning task first.
-
Fine-tuned models need prompting adjustments — A fine-tuned model may respond differently to the same prompts. Re-optimize your prompts after fine-tuning.
-
Temperature behaves differently — Quantized models can have altered temperature response curves. Start with lower temperatures.
Note:
Pro Move: For production self-hosting, run V4 Flash (AWQ quantized) on 2× A6000 GPUs. This gives you near-API quality at a fixed monthly cost. Reserve V4 Pro for tasks where the quality difference justifies 2-4× the GPU cost.
Note:
Quantization quality warning: INT4 quantization of MoE models can disproportionately affect the routing mechanism. Test your quantized model on tasks that require the model to route between different expert domains. If quality degrades unexpectedly, step up to INT8.
Related Pages
- Flash vs Pro — Self-hosting economics change depending on which model you deploy.
- Cost Optimization Patterns — Compare API costs vs self-hosting GPU costs to determine break-even.
Related Articles
Poetry Writing with ChatGPT: Master Poetic Forms
Master the art of crafting poetry using ChatGPT prompts. Learn to create sonnets, haiku, free verse, and experimental poetry with effective prompt templates.
Chibi & Kawaii SREF Codes for Midjourney
Discover cute chibi and kawaii SREF codes for Midjourney. Create adorable anime characters with exaggerated proportions, soft pastel aesthetics, and playful features.
Autonomous Coding Strategies: Guardrails & Task Decomposition for Claude Code
Give Claude Code the right guardrails for autonomous work. Master task decomposition, scope limiting, error recovery patterns, and human review checkpoints for reliable agentic coding.