KL3M 170M, 6th Gen Model, 37K Checkpoint
A 170M parameter language model trained on legal agreements using the Muon optimizer with spectral clamping.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 181.7M (170M non-embedding)
- Training Steps: 37,000
- Tokens Processed: 14.55 billion
- Sequence Length: 4,096 tokens
- Precision: BF16
- Optimizer: Muon with spectral regularization (max condition: 2000)
Model Architecture
- Hidden Size: 576
- Layers: 30
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
Training Configuration
Dataset
- Source:
alea-institute/kl3m-data-sample-002-shuffled - Type: Legal agreements (EDGAR filings, contracts)
- Format: Streaming, shuffled
Optimizer (Muon)
- Muon Learning Rate: 2.19e-4 (depth-scaled from 3e-4)
- Auxiliary Learning Rate: 3e-4
- Muon Weight Decay: 1e-4
- Auxiliary Weight Decay: 1e-3
- Muon Momentum: 0.95
- Batch Size: 6 per device
- Gradient Accumulation: 16 steps (effective batch: 96)
- Warmup Steps: 1,000
- LR Scheduler: Cosine with min ratio 0.1
Regularization
- Spectral Clamping:
- Enabled on q_proj, o_proj, and lm_head
- Max condition number: 2000
- Sigma floor: 1e-4
- Applied every 10 steps (every 960 samples)
- Adaptive Gradient Clipping: Enabled (ฮฒ=0.9, coeff=2.0)
- Label Smoothing: 0.01
- Entropy Regularization:
- Entropy bonus weight: 0.005
- Entropy target: 6.5 bits (weight: 0.005)
- Activation norm weight: 0.001
- Loss chunk size: 1024 tokens
Training Infrastructure
- Mixed Precision: BF16
- Gradient Checkpointing: Enabled (non-reentrant)
- Flash Attention: Auto-enabled
- TF32 Mode: Auto
Spectral Health (Step 37K)
Analysis of weight matrix conditioning shows excellent manifold quality:
Condition Numbers
- Attention Layers:
- Median: 349.79 โ EXCELLENT
- Mean: 784.46
- P95: 2000.15 (at spectral clamp ceiling)
- Max: 2000.26
- MLP Layers:
- Median: 4.70 โ EXCELLENT
- Mean: 4.86
- Max: 8.40
- LM Head: 261.88 โ GOOD
Singular Values
- Smallest ฯ_min: 5.42e-4 (well above ฯ_floor of 1e-4)
- Top 5 smallest:
- layers.0.self_attn.o_proj: 5.42e-4
- layers.5.self_attn.o_proj: 5.44e-4
- layers.9.self_attn.o_proj: 6.07e-4
- layers.22.self_attn.o_proj: 6.09e-4
- layers.4.self_attn.o_proj: 6.15e-4
Key Finding: Many attention projection layers (q_proj, o_proj) are actively hitting the spectral clamp ceiling of 2000, indicating the regularization is working as intended to prevent ill-conditioning.
Training Dynamics (Steps 33K-37K)
- Loss (100-step avg): 2.081 โ 2.146 (+0.066)
- Note: Slight increase suggests model encountered harder examples or dataset shift
- Gradient Norm: Median 1.90, P95 4.29
- Gradient Clipping Rate: 3.6% (well-controlled)
- Learning Rate: Minimal decay (0.000219 โ 0.000218)
Generation Quality
Generates coherent, fluent legal text with no repetition issues. Suitable for legal/contractual content generation and analysis.
Usage
from transformers import pipeline
# Create text generation pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-006-170m-checkpoint-37000",
torch_dtype="auto",
device_map="auto"
)
# Generate text
outputs = generator(
"This Agreement is entered into as of",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(outputs[0]['generated_text'])
Model Card Authors
Alea Institute
Citation
For technical details, see the paper: https://arxiv.org/abs/2504.07854
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training},
author={Alea Institute},
year={2025},
url={https://arxiv.org/abs/2504.07854},
note={Trained with Muon optimizer and spectral clamping}
}
License
Apache 2.0
- Downloads last month
- 5