DeepSeek V4 Pro and Flash are fully open-weight — available on HuggingFace with a permissive license. The same models that power the API can run on your own hardware with full control over inference parameters, quantization settings, and fine-tuning. For organizations with data sovereignty requirements, predictable workloads, or custom domain adaptation needs, self-hosting is viable.

But V4 Pro is 1.6T total parameters (49B active via Mixture of Experts). This is not a weekend Raspberry Pi project. The pages below cover the practical realities of self-hosting — from GPU requirements to fine-tuning workflows.

Self-Hosting vs API: Decision Framework

Choose the API when:

Monthly token volume is under 50M (break-even is ~50M tokens/month for Pro on dedicated GPUs)
You need burst capacity (API handles spikes without idle GPU costs)
You don't want to manage infrastructure
You're prototyping or have unpredictable workloads

Choose self-hosting when:

Data sovereignty requires on-premise processing
Monthly volume is high and predictable (>100M tokens/month)
You need custom fine-tuning with proprietary data
Latency requirements can't accept API network overhead
You need guaranteed capacity (no rate limits)

GPU Requirements

V4 Pro (1.6T total / 49B active MoE)

Precision	VRAM Required	Recommended GPUs
FP16	~100GB	2× A100 80GB, 4× A6000 48GB
INT8 (AWQ)	~50GB	1× A100 80GB, 2× A6000 48GB
INT4 (GPTQ)	~25GB	1× A6000 48GB, 1× RTX 4090 24GB
Q4_K_M (GGUF)	~28GB	1× A6000 48GB, 1× RTX 4090 24GB

V4 Flash (284B total / 13B active MoE)

Precision	VRAM Required	Recommended GPUs
FP16	~28GB	1× A6000 48GB, 1× RTX 4090 24GB
INT8 (AWQ)	~14GB	1× RTX 4090 24GB, 1× RTX 3090 24GB
INT4 (GPTQ)	~7GB	1× RTX 3060 12GB, Apple M2 Max 32GB
Q4_K_M (GGUF)	~8GB	1× RTX 4060 Ti 16GB

Deployment with vLLM

vLLM is the recommended inference engine for DeepSeek:

pip install vllm

# Serve V4 Flash (AWQ quantized)
vllm serve deepseek-ai/DeepSeek-V4-Flash-AWQ \
    --max-model-len 131072 \
    --gpu-memory-utilization 0.95

# Serve V4 Pro (FP16 on 2× A100)
vllm serve deepseek-ai/DeepSeek-V4-Pro \
    --tensor-parallel-size 2 \
    --max-model-len 131072

vLLM Configuration Tips

# Optimize for throughput
--max-num-seqs 256           # Concurrent sequences
--max-model-len 131072       # Limit context to manage VRAM
--gpu-memory-utilization 0.95 # Use nearly all GPU memory
--enable-prefix-caching       # Enable prefix caching (similar to API)
--enable-chunked-prefill      # Better throughput for long prompts

# Optimize for latency
--max-num-seqs 8             # Fewer concurrent sequences
--max-num-batched-tokens 8192 # Smaller batches

Quantization Options

AWQ (Recommended for GPU inference)

Best quality-to-performance ratio for GPU deployment:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "deepseek-ai/DeepSeek-V4-Flash"
quant_path = "DeepSeek-V4-Flash-AWQ"

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

model.quantize(tokenizer, quant_config={"zero_point": True, "q_group_size": 128})
model.save_quantized(quant_path)

GGUF (Recommended for local/CPU inference)

Best for running on consumer hardware or CPU-only:

# Convert to GGUF
python convert-hf-to-gguf.py DeepSeek-V4-Flash --outtype q4_k_m

# Run with llama.cpp
./llama-cli -m DeepSeek-V4-Flash-Q4_K_M.gguf \
    -p "Explain quantum computing" \
    -n 512 \
    --ctx-size 32768

GPTQ (High compression)

Maximum compression for constrained VRAM:

from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    quantize_config={"bits": 4, "group_size": 128}
)

Fine-Tuning

LoRA (Recommended for most use cases)

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    load_in_8bit=True,  # Quantize for memory efficiency
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,                   # LoRA rank
    lora_alpha=32,          # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

training_args = TrainingArguments(
    output_dir="./deepseek-lora",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()
model.save_pretrained("./deepseek-lora-final")

QLoRA (For consumer GPUs)

4-bit quantization + LoRA for fine-tuning on a single 24GB GPU:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-V4-Flash",
    quantization_config=bnb_config,
    device_map="auto"
)
# Then apply LoRA as above

Prompting Differences: Self-Hosted vs API

Aspect	API	Self-Hosted
Thinking mode	Native support	Depends on inference engine
Context caching	Automatic, disk-based	Engine-dependent (vLLM: enable-prefix-caching)
Temperature control	Always available	Always available
System prompt	Fully supported	Fully supported
Tool calls	API handles	Must implement tool loop yourself
Quality	Full precision	Quality loss from quantization possible

Self-Hosted Prompting Best Practices

Test quantized models thoroughly — INT4 can subtly degrade reasoning quality. Always benchmark quantized vs FP16 on your specific task.
Context caching isn't automatic — vLLM's --enable-prefix-caching provides similar behavior, but other engines may not.
Thinking mode may require engine support — Not all inference engines support DeepSeek's thinking mode. Test with a simple reasoning task first.
Fine-tuned models need prompting adjustments — A fine-tuned model may respond differently to the same prompts. Re-optimize your prompts after fine-tuning.
Temperature behaves differently — Quantized models can have altered temperature response curves. Start with lower temperatures.

Note:

Pro Move: For production self-hosting, run V4 Flash (AWQ quantized) on 2× A6000 GPUs. This gives you near-API quality at a fixed monthly cost. Reserve V4 Pro for tasks where the quality difference justifies 2-4× the GPU cost.

Note:

Quantization quality warning: INT4 quantization of MoE models can disproportionately affect the routing mechanism. Test your quantized model on tasks that require the model to route between different expert domains. If quality degrades unexpectedly, step up to INT8.

Flash vs Pro — Self-hosting economics change depending on which model you deploy.
Cost Optimization Patterns — Compare API costs vs self-hosting GPU costs to determine break-even.

DeepSeek Open Weights: Self-Hosting & Fine-Tuning Guide

Self-Hosting vs API: Decision Framework

Choose the API when:

Choose self-hosting when:

GPU Requirements

V4 Pro (1.6T total / 49B active MoE)

V4 Flash (284B total / 13B active MoE)

Deployment with vLLM

vLLM Configuration Tips

Quantization Options

AWQ (Recommended for GPU inference)

GGUF (Recommended for local/CPU inference)

GPTQ (High compression)

Fine-Tuning

LoRA (Recommended for most use cases)

QLoRA (For consumer GPUs)

Prompting Differences: Self-Hosted vs API

Self-Hosted Prompting Best Practices

Related Articles

Create Fantasy Characters in Midjourney - Complete Guide

Furniture & Decor Prompts: Custom Design

Product Mockup Prompts: E-commerce Photography

On this page

DeepSeek Open Weights: Self-Hosting & Fine-Tuning Guide

Self-Hosting vs API: Decision Framework

Choose the API when:

Choose self-hosting when:

GPU Requirements

V4 Pro (1.6T total / 49B active MoE)

V4 Flash (284B total / 13B active MoE)

Deployment with vLLM

vLLM Configuration Tips

Quantization Options

AWQ (Recommended for GPU inference)

GGUF (Recommended for local/CPU inference)

GPTQ (High compression)

Fine-Tuning

LoRA (Recommended for most use cases)

QLoRA (For consumer GPUs)

Prompting Differences: Self-Hosted vs API

Self-Hosted Prompting Best Practices

Related Pages

Related Articles

Create Fantasy Characters in Midjourney - Complete Guide

Furniture & Decor Prompts: Custom Design

Product Mockup Prompts: E-commerce Photography

On this page