DeepSeek Open Weights: Self-Hosting & Fine-Tuning Guide

Master DeepSeek V4 self-hosting with open weights from HuggingFace. When to self-host vs use the API, vLLM deployment, quantization (GGUF/AWQ/GPTQ), and LoRA/QLoRA fine-tuning workflows.

June 11, 2026
DeepSeekOpen SourceSelf-HostingvLLMFine-TuningGGUF

DeepSeek V4 Pro and Flash are fully open-weight — available on HuggingFace with a permissive license. The same models that power the API can run on your own hardware with full control over inference parameters, quantization settings, and fine-tuning. For organizations with data sovereignty requirements, predictable workloads, or custom domain adaptation needs, self-hosting is viable.

But V4 Pro is 1.6T total parameters (49B active via Mixture of Experts). This is not a weekend Raspberry Pi project. The pages below cover the practical realities of self-hosting — from GPU requirements to fine-tuning workflows.

Self-Hosting vs API: Decision Framework

Choose the API when:

  • Monthly token volume is under 50M (break-even is ~50M tokens/month for Pro on dedicated GPUs)
  • You need burst capacity (API handles spikes without idle GPU costs)
  • You don't want to manage infrastructure
  • You're prototyping or have unpredictable workloads

Choose self-hosting when:

  • Data sovereignty requires on-premise processing
  • Monthly volume is high and predictable (>100M tokens/month)
  • You need custom fine-tuning with proprietary data
  • Latency requirements can't accept API network overhead
  • You need guaranteed capacity (no rate limits)

GPU Requirements

V4 Pro (1.6T total / 49B active MoE)

PrecisionVRAM RequiredRecommended GPUs
FP16~100GB2× A100 80GB, 4× A6000 48GB
INT8 (AWQ)~50GB1× A100 80GB, 2× A6000 48GB
INT4 (GPTQ)~25GB1× A6000 48GB, 1× RTX 4090 24GB
Q4_K_M (GGUF)~28GB1× A6000 48GB, 1× RTX 4090 24GB

V4 Flash (284B total / 13B active MoE)

PrecisionVRAM RequiredRecommended GPUs
FP16~28GB1× A6000 48GB, 1× RTX 4090 24GB
INT8 (AWQ)~14GB1× RTX 4090 24GB, 1× RTX 3090 24GB
INT4 (GPTQ)~7GB1× RTX 3060 12GB, Apple M2 Max 32GB
Q4_K_M (GGUF)~8GB1× RTX 4060 Ti 16GB

Deployment with vLLM

vLLM is the recommended inference engine for DeepSeek:

pip install vllm

# Serve V4 Flash (AWQ quantized)
vllm serve deepseek-ai/DeepSeek-V4-Flash-AWQ \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95

# Serve V4 Pro (FP16 on 2× A100)
vllm serve deepseek-ai/DeepSeek-V4-Pro \
    --tensor-parallel-size 2 \
    --max-model-len 131072

vLLM Configuration Tips

# Optimize for throughput
--max-num-seqs 256           # Concurrent sequences
--max-model-len 131072       # Limit context to manage VRAM
--gpu-memory-utilization 0.95 # Use nearly all GPU memory
--enable-prefix-caching       # Enable prefix caching (similar to API)
--enable-chunked-prefill      # Better throughput for long prompts

# Optimize for latency
--max-num-seqs 8             # Fewer concurrent sequences
--max-num-batched-tokens 8192 # Smaller batches

Quantization Options

Best quality-to-performance ratio for GPU deployment:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "deepseek-ai/DeepSeek-V4-Flash"
quant_path = "DeepSeek-V4-Flash-AWQ"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.quantize(tokenizer, quant_config={"zero_point": True, "q_group_size": 128})
model.save_quantized(quant_path)

Best for running on consumer hardware or CPU-only:

# Convert to GGUF
python convert-hf-to-gguf.py DeepSeek-V4-Flash --outtype q4_k_m

# Run with llama.cpp
./llama-cli -m DeepSeek-V4-Flash-Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 512 \
    --ctx-size 32768

GPTQ (High compression)

Maximum compression for constrained VRAM:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    quantize_config={"bits": 4, "group_size": 128}
)

Fine-Tuning

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    load_in_8bit=True,  # Quantize for memory efficiency
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,                   # LoRA rank
    lora_alpha=32,          # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./deepseek-lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
model.save_pretrained("./deepseek-lora-final")

QLoRA (For consumer GPUs)

4-bit quantization + LoRA for fine-tuning on a single 24GB GPU:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA as above

Prompting Differences: Self-Hosted vs API

AspectAPISelf-Hosted
Thinking modeNative supportDepends on inference engine
Context cachingAutomatic, disk-basedEngine-dependent (vLLM: enable-prefix-caching)
Temperature controlAlways availableAlways available
System promptFully supportedFully supported
Tool callsAPI handlesMust implement tool loop yourself
QualityFull precisionQuality loss from quantization possible

Self-Hosted Prompting Best Practices

  1. Test quantized models thoroughly — INT4 can subtly degrade reasoning quality. Always benchmark quantized vs FP16 on your specific task.

  2. Context caching isn't automatic — vLLM's --enable-prefix-caching provides similar behavior, but other engines may not.

  3. Thinking mode may require engine support — Not all inference engines support DeepSeek's thinking mode. Test with a simple reasoning task first.

  4. Fine-tuned models need prompting adjustments — A fine-tuned model may respond differently to the same prompts. Re-optimize your prompts after fine-tuning.

  5. Temperature behaves differently — Quantized models can have altered temperature response curves. Start with lower temperatures.

Note:

Pro Move: For production self-hosting, run V4 Flash (AWQ quantized) on 2× A6000 GPUs. This gives you near-API quality at a fixed monthly cost. Reserve V4 Pro for tasks where the quality difference justifies 2-4× the GPU cost.

Note:

Quantization quality warning: INT4 quantization of MoE models can disproportionately affect the routing mechanism. Test your quantized model on tasks that require the model to route between different expert domains. If quality degrades unexpectedly, step up to INT8.