Soft Prompting: Trainable Embeddings as Prompts

Replace text prompts with learned continuous vectors. Understand prompt tuning, prefix tuning, and p-tuning for open models where fine-tuning is impractical.

June 10, 2026
soft-promptingprompt-tuningprefix-tuningpeftprompt-engineering

Hard Prompts vs Soft Prompts

Hard Prompt (Text)Soft Prompt (Embeddings)
What it isHuman-written text instructionsLearned continuous vectors
How it's createdWritten and iterated by humansTrained via backpropagation
Interpretable?Yes — you can read itNo — vector salad
Requires training?NoYes — needs labeled data
Model modification?No — inference onlyNo — model weights frozen
Provider supportAll providersOpen models only (Llama, Mistral)

Soft prompting (Lester et al. 2021) trades interpretability for efficiency: train a tiny set of prompt embeddings while keeping the model frozen. Useful when you have labeled data but fine-tuning a billion-parameter model is impractical.

How It Works

Instead of prepending text, prepend learnable embedding vectors:

Hard prompt:
"Classify the sentiment: [input text]"

Soft prompt:
[vect_1] [vect_2] [vect_3] ... [vect_N] [input embedding]
                                    ↑
                              Trained via backprop, model frozen

During training, only the prompt vectors are updated. The model processes prompt_vectors + input_embedding and the loss backpropagates through the frozen model to update only the prompt vectors.

Variants

MethodWhat It TunesWhere It GoesKey Paper
Prompt TuningInput embedding layer onlyPrepended to inputLester et al. 2021
Prefix TuningActivations at every transformer layerPrepended to keys/values at each layerLi & Liang 2021
P-Tuning v2Deep prompt tokens at every layerContinuous prompts throughout model depthLiu et al. 2022
LoRALow-rank adapter matrices (not technically soft prompting)Injected into attention layersHu et al. 2022

Parameter Efficiency

A soft prompt is tiny compared to the model:

ComponentParameters (Llama 2 7B)
Full model7 billion
Full fine-tuning7 billion (all updated)
LoRA~8 million
Soft prompt (100 tokens)~409,600
Soft prompt (20 tokens)~81,920

You can train a soft prompt on a single GPU in minutes, vs days for full fine-tuning.

When Soft Prompting Makes Sense

Use soft prompting when:

  • You have labeled task data (100-1000+ examples)
  • You need to run the same task repeatedly (classification, extraction at scale)
  • You're using open-source models (Llama, Mistral, Qwen) where you control inference
  • You want to avoid modifying model weights (safer than fine-tuning for overwriting capabilities)

Don't use soft prompting when:

  • You're using OpenAI, Anthropic, or Google APIs (they don't expose embedding injection)
  • You have no training data (soft prompts must be trained)
  • The task changes frequently (retraining overhead defeats the purpose)
  • You need interpretable prompts (soft prompts are opaque vectors)

Implementation with HuggingFace PEFT

from peft import PromptTuningConfig, TaskType, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Define soft prompt config
peft_config = PromptTuningConfig(
    task_type=TaskType.CAUSAL_LM,
    num_virtual_tokens=20,         # 20 learnable prompt tokens
    prompt_tuning_init="TEXT",     # Initialize from text
    prompt_tuning_init_text="Classify the sentiment of this review:",
    tokenizer_name_or_path="meta-llama/Llama-2-7b-hf",
)

model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 81,920 || all params: 6,738,841,600 || trainable%: 0.0012

# Train normally
from transformers import Trainer, TrainingArguments
trainer = Trainer(
    model=model,
    args=TrainingArguments(output_dir="./soft-prompt", num_train_epochs=10),
    train_dataset=dataset,
)
trainer.train()

# Save just the soft prompt — tiny file
model.save_pretrained("./my-soft-prompt")

Loading and Using a Trained Soft Prompt

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
model = PeftModel.from_pretrained(base_model, "./my-soft-prompt")

# Inference — the soft prompt is automatically prepended
inputs = tokenizer("This product exceeded my expectations.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=10)
print(tokenizer.decode(outputs[0]))  # Should output sentiment classification

Limitations

No API support. OpenAI, Anthropic, and Google do not expose model internals for embedding injection. Soft prompting is only viable with self-hosted open models.

Not interpretable. You can't read a soft prompt to understand what it learned. The tradeoff for efficiency is opacity.

Task-specific. Each soft prompt is trained for one task. Changing tasks means training a new one. You can swap prompt files quickly, but you can't generalize across tasks.

Requires training data. Soft prompts need labeled examples. If you have zero training data, stick with hard prompt engineering.

Training instability. Small prompt sizes can be sensitive to initialization and hyperparameters. Start with prompt_tuning_init="TEXT" for stable initialization.