File size: 9,831 Bytes
140ab02 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 140ab02 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 3d31c12 dff8d35 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
---
title: Activation-Level Preference Unlearning (AG-Masked-LoRA)
tags:
- unlearning
- alignment
- large-language-models
- transformers
- qwen2.5
- lora
- fine-tuning
- safety
- preference-modeling
license: mit
datasets: []
model-index:
- name: Activation-Level Preference Unlearning
results: []
---
<p align="center">
<a href="https://github.com/rameyjm7/llm-preference-unlearning">
<img src="https://img.shields.io/badge/Status-Research%20Project-success?style=flat-square" />
</a>
<a href="https://github.com/rameyjm7/llm-preference-unlearning/blob/main/LICENSE">
<img src="https://img.shields.io/badge/License-MIT-blue?style=flat-square" />
</a>
<img src="https://img.shields.io/badge/Python-3.10+-blue?style=flat-square" />
<img src="https://img.shields.io/badge/Model-Qwen2.5--3B-orange?style=flat-square" />
<img src="https://img.shields.io/badge/Tasks-Activation%20Probing%20%7C%20Unlearning-yellow?style=flat-square" />
<img src="https://img.shields.io/badge/Compute-A100%20%7C%20L4%20%7C%20Jetson%20Orin-lightgrey?style=flat-square" />
<img src="https://img.shields.io/badge/Framework-Transformers%20%7C%20PyTorch-red?style=flat-square" />
<img src="https://img.shields.io/badge/Notebooks-Jupyter-green?style=flat-square" />
</p>
# Activation-Level Preference Unlearning (AG-Masked-LoRA)
### Removing Latent Concepts While Preserving Global LLM Reasoning
---
## Abstract
Large Language Models (LLMs) increasingly power recommender systems, yet they often exhibit unstable or biased preference formation. Minor variations in prompt phrasing can activate different internal representations, leading to inconsistent or policy-violating outputs.
This project introduces **Activation-Guided Masked LoRA (AG-Masked-LoRA)**, a targeted unlearning method that identifies and suppresses the *activation subspace* responsible for an undesired concept—demonstrated here with **movie-title generation (“Inception”)**.
Our pipeline integrates:
- Activation probing
- Prompt perturbation stability analysis
- Gradient and saliency mapping
- Fisher information profiling
- Subspace-masked LoRA training
- Incremental concept-level unlearning
Results show that the model cleanly forgets the targeted concept while preserving reasoning, fluency, and instruction fidelity.
---
## Motivation
LLM-based recommendation and generation systems embed user intent, item associations, and implicit priors in high-dimensional activation pathways. While powerful, this creates challenges:
1. Over-specific or incorrect recommendations due to activation drift.
2. Entrenched behaviors from prior fine-tuning.
3. Difficulty suppressing copyrighted, unsafe, or policy-restricted content.
4. Entanglement of desirable and undesirable behaviors within shared neuron groups.
Understanding how specific prompts activate internal representations is critical both for trustworthy recommenders and for enterprise-grade safety alignment.
**Activation-guided unlearning** specifically addresses this: by identifying which neurons encode an unwanted concept and restricting LoRA updates to that region of latent space, we can *remove* a capability rather than merely filtering tokens.
---
# Phase 1 — Prompt Perturbation & Instability Analysis
Prompt variations intended to be semantically identical yield inconsistent movie-title recommendations, revealing instability in how Qwen2.5-3B processes preference queries.
<p align="center">
<img width="360" height="250" src="https://github.com/user-attachments/assets/15568030-0d49-4e5a-9b97-d25c7448f575" />
<img width="359" height="256" src="https://github.com/user-attachments/assets/69297c65-69fa-4fd7-9c8a-50c60226a2ed" />
</p>
**Figure 1.** Semantically equivalent prompts produce different responses, indicating latent-space sensitivity and inconsistent preference encoding.
<p align="center">
<img width="899" height="351" src="https://github.com/user-attachments/assets/b5563a21-e855-4bdc-8e50-fa86ed869067" />
</p>
**Figure 2.** Direct prompt-perturbation: phrasing changes alter the generated movie title, confirming activation-level instability.
---
# Phase 2 — Activation Probing, Saliency, and Gradient Sensitivity
We analyze how each transformer layer responds when the model attempts to generate a movie title.
### Layerwise Gradient Sensitivity
<p align="center">
<img width="975" height="444" src="https://github.com/user-attachments/assets/ba76ef82-3a03-4a03-8ea1-6a840bf79bb2" />
</p>
**Figure 3.** Gradient sensitivity map showing which layers’ activations shift most strongly in response to movie-title prompting.
### Saliency (Gradient × Activation)
<p align="center">
<img width="975" height="498" src="https://github.com/user-attachments/assets/eea9dd69-1827-4545-bed8-1a9aa522f43f" />
</p>
**Figure 4.** Saliency heatmap identifying layers whose neurons strongly encode the movie-title concept.
### Combined Sensitivity Analysis
<p align="center">
<img width="975" height="592" src="https://github.com/user-attachments/assets/62fb3c65-beb1-4995-8534-5eb645521956" />
</p>
**Figure 5.** Layerwise correlation of saliency, Fisher information, and activation similarity identifies a consistent high-impact region in mid-model layers.
---
# Phase 3 — Semantic Similarity vs Activation Structure
We measure whether semantic similarity across prompts matches activation-level similarity.
<p align="center">
<img width="975" height="797" src="https://github.com/user-attachments/assets/1de1b763-c637-4d2d-b9f6-d0afcb002748" />
</p>
**Figure 6.** Semantic similarity (top) vs activation overlap (bottom).
Prompts that *mean the same thing* do *not* necessarily activate the same neurons—revealing a root cause of preference drift.
---
# Phase 4 — Fisher Information Profiling
<p align="center">
<img width="975" height="403" src="https://github.com/user-attachments/assets/96306821-75d8-49c8-9d5f-26906a6d48e1" />
</p>
**Figure 7.** Mean gradient norm per layer, pinpointing where the model is most sensitive.
<p align="center">
<img width="975" height="511" src="https://github.com/user-attachments/assets/e05e5312-1a11-4f1e-b199-1ef3574504a8" />
</p>
**Figure 8.** Fisher information heatmap showing which neurons maintain the highest influence on movie-title generation.
---
# Phase 5 — Activation-Guided Masked LoRA (AG-Masked-LoRA)
A low-rank update is selectively applied *only* to neurons identified as encoding the targeted concept.
LoRA is trained on prompts that normally elicit movie titles, but uses a **FORGOTTEN/UNKNOWN** target output.
The update is masked to affect only sensitive neurons, leaving the rest of the model untouched.
<p align="center">
<img width="975" height="61" src="https://github.com/user-attachments/assets/2c95a68f-e383-48b7-8322-9f5595fb8575" />
<img width="861" height="407" src="https://github.com/user-attachments/assets/6de06210-da85-4958-986e-5085c5bd5a93" />
</p>
**Figure 9.** Incremental unlearning logs showing loss reduction while applying masked LoRA updates.
---
# Phase 6 — Evaluation: Before/After Unlearning
### Base Model (Before/After)
<p align="center">
<img width="975" height="456" src="https://github.com/user-attachments/assets/94f51feb-b1f2-4f48-b6d2-3d6a70a72205" />
</p>
### Unlearned Model (Before/After)
<p align="center">
<img width="975" height="219" src="https://github.com/user-attachments/assets/c1f1f680-42f5-41b3-b53f-e00efcc68cef" />
<img width="975" height="209" src="https://github.com/user-attachments/assets/d9242c27-3ccc-490d-a18e-4439424d2911" />
</p>
**Figure 10–11.** The unlearned model consistently returns FORGOTTEN/UNKNOWN across paraphrased prompts.
### Direct Concept Probing (Before/After)
<p align="center">
<img width="975" height="238" src="https://github.com/user-attachments/assets/c64d9a90-c35a-48dd-a946-209e4c6a6db6" />
<img width="970" height="244" src="https://github.com/user-attachments/assets/789d64a7-1361-4d0f-9d41-23259c920376" />
</p>
**Figure 12.** Even when asked explicitly about “Inception,” the model no longer retrieves or describes it—indicating true concept removal.
---
# Final Findings
Our experiments confirm that AG-Masked-LoRA performs **structural semantic unlearning**, not superficial keyword suppression.
### Key Results
- **Generalizes across paraphrasing**
Unlearning holds under all prompt-perturbation variants.
- **Consistent neuron clusters identified**
Saliency + Fisher converge on the same mid-model layers.
- **Clear activation shift**
PCA and activation distance show pre/post separation.
- **Global reasoning preserved**
No degradation in unrelated tasks or instruction following.
- **Deployment ready**
Runs cleanly on A100, L4, and Jetson Orin.
### Conclusion
AG-Masked-LoRA removes entire *latent concepts* by rewriting only the activation pathways responsible for them. This makes it suitable for:
- Safety-critical filtering
- Policy enforcement
- Copyright-restricted retrieval removal
- Reversible domain-specific behavior modules
The base model remains unmodified—only a small adapter controls the behavior.
---
# Project Resources and Repositories
### GitHub — Full Source Code
**LLM Preference Unlearning (Activation-Level Framework)**
https://github.com/rameyjm7/llm-preference-unlearning
Includes:
- Modular notebooks (00–08)
- Unified pipeline notebook
- Activation probe scripts
- Saliency, gradient, Fisher analysis
- Incremental unlearning engine
- Figures and logs
### HuggingFace — Model Card & Artifacts
**Activation-Level Preference Unlearning (HF)**
https://huggingface.co/rameyjm7/llm-preference-unlearning
Includes:
- Model card
- Figures and evaluation
- Adapter artifacts (optional)
- Notebook links
---
## License
MIT License.
|