Parameter-efficient fine-tuning

Parameter-efficient fine-tuning (PEFT) methods only fine-tune a small number of extra model parameters (adapters) on top of a pretrained model. Because only adapter parameters are updated, the optimizer tracks far fewer gradients and states, reducing memory usage significantly. Adapters are lightweight, making them convenient to share, store, and load.

Transformers integrates directly with the PEFT library through PeftAdapterMixin, added to all PreTrainedModel classes. You can load, add, train, switch, and delete adapters without wrapping your model in a separate PeftModel. All non-prompt-learning PEFT methods are supported (LoRA, IA3, AdaLoRA). Prompt-based methods like prompt tuning and prefix tuning require using the PEFT library directly.

Install PEFT to get started. The integration requires peft >= 0.18.0.

pip install -U peft

Add an adapter

Create a PEFT config, like LoraConfig for example, and attach it to a model with add_adapter().

from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b")

lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    inference_mode=False,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
)

model.add_adapter(lora_config, adapter_name="my_adapter")

Fully fine-tuning specific layers

To train additional modules alongside an adapter (for example, the language model head), specify them in modules_to_save. modules_to_save specifies layers that are fully fine-tuned alongside the adapter, so all of their parameters are updated. This is useful when certain layers need updates, for example the language model head (lm_head), when adapting a causal LM for sequence classification.

lora_config = LoraConfig(
    modules_to_save=["lm_head"],
    ...
)
model.add_adapter(lora_config)

Choosing which layers to adapt

For common architectures (Llama, Gemma, Qwen, etc.), PEFT has predefined default targets (like q_proj and v_proj), so you don’t need to specify target_modules. If you want to target different layers, or the model doesn’t have predefined targets, pass target_modules explicitly as a list of module names or a regex pattern.

lora_config = LoraConfig(
    target_modules=["q_proj", "k_proj"],
    ...
)
model.add_adapter(lora_config)

Training

Pass the model with an attached adapter to Trainer and call train(). Trainer only updates the adapter parameters (those with requires_grad=True) because the base model is frozen.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

trainer.train()

During training, Trainer checkpoints contain only the adapter weights (adapter_model.safetensors) and configuration (adapter_config.json), keeping checkpoints small. The base model isn’t included.

After training, save the final adapter with save_pretrained().

model.save_pretrained("./my_adapter")

Resuming from a checkpoint

Trainer automatically detects adapter checkpoints when resuming. Trainer scans the checkpoint directory for subdirectories containing adapter weights and reloads each adapter with the correct trainable state.

trainer.train(resume_from_checkpoint="./output/checkpoint-1000")

Distributed training

PEFT adapters work with distributed training out of the box.

For ZeRO-3, Trainer passes exclude_frozen_parameters=True when saving checkpoints with a PEFT model. Frozen base model weights are skipped. Only the trainable adapter parameters are saved, reducing checkpoint size and save time.

For FSDP, Trainer updates the FSDP auto-wrap policy to correctly handle LoRA layers. For QLoRA (quantized base model + LoRA), Trainer also adjusts the mixed precision policy to match the quantization storage dtype.

Loading an adapter

To load an adapter, the Hub repository or local directory must contain an adapter_config.json file and the adapter weights.

from_pretrained

load_adapter

For large models, load a quantized version in 8-bit or 4-bit precision with bitsandbytes to save memory. Add device_map="auto" to distribute the model across available hardware.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

model = AutoModelForCausalLM.from_pretrained(
    "klcsp/gemma7b-lora-alpaca-11-v1",
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    device_map="auto",
)

Managing multiple adapters

A model can hold multiple adapters at once. Add adapters with unique names, and switch between them as needed.

from peft import LoraConfig

model.add_adapter(LoraConfig(r=8, lora_alpha=32), adapter_name="adapter_1")
model.add_adapter(LoraConfig(r=16, lora_alpha=64), adapter_name="adapter_2")

Use set_adapter() to activate a specific adapter. The other adapters are disabled but remain in memory.

model.set_adapter("adapter_2")

enable_adapters() enables all attached adapters, and disable_adapters() disables all of them.

# Disable all adapters for base model inference
model.disable_adapters()

# Re-enable all adapters
model.enable_adapters()

Use active_adapters() to see which adapters are currently active.

model.active_adapters()
# ["adapter_1"]

Remove adapters you no longer need with delete_adapter() to free memory.

model.delete_adapter("adapter_1")

Hotswapping adapters

Loading a new adapter each time you serve a request allocates new memory. If the model is compiled with torch.compile, each new adapter triggers recompilation. Hotswapping replaces adapter weights in-place, avoiding both issues. Only LoRA adapters are supported.

Pass hotswap=True when loading a LoRA adapter to swap its weights into an existing adapter slot. Set adapter_name to the name of the adapter to replace ("default" is the default adapter name).

model = AutoModel.from_pretrained(...)
# Load the first adapter normally
model.load_adapter(adapter_path_1)
# Generate outputs with adapter 1
...
# Hotswap the second adapter in-place
model.load_adapter(adapter_path_2, hotswap=True, adapter_name="default")
# Generate outputs with adapter 2

torch.compile

For compiled models, call enable_peft_hotswap() before loading the first adapter and before compiling.

model = AutoModel.from_pretrained(...)
max_rank = ...  # highest rank among all LoRAs you'll load
model.enable_peft_hotswap(target_rank=max_rank)
model.load_adapter(adapter_path_1, adapter_name="default")
model = torch.compile(model, ...)
output_1 = model(...)

# Hotswap without recompilation
model.load_adapter(adapter_path_2, adapter_name="default")
output_2 = model(...)

The target_rank argument sets the maximum rank among all LoRA adapters you’ll load. If you have adapters with rank 8 and rank 16, pass target_rank=16. The default is 128.

After calling enable_peft_hotswap, all subsequent load_adapter calls hotswap by default. Pass hotswap=False explicitly to disable hotswapping.

Recompilation may still occur if the hotswapped adapter targets more layers than the initial adapter. Load the adapter that targets the most layers first to avoid recompilation.

Wrap your code in with torch._dynamo.config.patch(error_on_recompile=True) to detect unexpected recompilation. If you detect recompilation despite following the steps above, open an issue with PEFT with a reproducible example.

Next steps

The PEFT documentation covers the full range of PEFT methods and options.
The PEFT hotswapping reference details limitations and edge cases.
A blog post benchmarks how torch.compile with hotswapping improves runtime.

Update on GitHub