KL3M 170M, 6th Gen Model, 37K Checkpoint

A 170M parameter language model trained on legal agreements using the Muon optimizer with spectral clamping.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 181.7M (170M non-embedding)
  • Training Steps: 37,000
  • Tokens Processed: 14.55 billion
  • Sequence Length: 4,096 tokens
  • Precision: BF16
  • Optimizer: Muon with spectral regularization (max condition: 2000)

Model Architecture

  • Hidden Size: 576
  • Layers: 30
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000

Training Configuration

Dataset

  • Source: alea-institute/kl3m-data-sample-002-shuffled
  • Type: Legal agreements (EDGAR filings, contracts)
  • Format: Streaming, shuffled

Optimizer (Muon)

  • Muon Learning Rate: 2.19e-4 (depth-scaled from 3e-4)
  • Auxiliary Learning Rate: 3e-4
  • Muon Weight Decay: 1e-4
  • Auxiliary Weight Decay: 1e-3
  • Muon Momentum: 0.95
  • Batch Size: 6 per device
  • Gradient Accumulation: 16 steps (effective batch: 96)
  • Warmup Steps: 1,000
  • LR Scheduler: Cosine with min ratio 0.1

Regularization

  • Spectral Clamping:
    • Enabled on q_proj, o_proj, and lm_head
    • Max condition number: 2000
    • Sigma floor: 1e-4
    • Applied every 10 steps (every 960 samples)
  • Adaptive Gradient Clipping: Enabled (ฮฒ=0.9, coeff=2.0)
  • Label Smoothing: 0.01
  • Entropy Regularization:
    • Entropy bonus weight: 0.005
    • Entropy target: 6.5 bits (weight: 0.005)
    • Activation norm weight: 0.001
    • Loss chunk size: 1024 tokens

Training Infrastructure

  • Mixed Precision: BF16
  • Gradient Checkpointing: Enabled (non-reentrant)
  • Flash Attention: Auto-enabled
  • TF32 Mode: Auto

Spectral Health (Step 37K)

Analysis of weight matrix conditioning shows excellent manifold quality:

Condition Numbers

  • Attention Layers:
    • Median: 349.79 โœ“ EXCELLENT
    • Mean: 784.46
    • P95: 2000.15 (at spectral clamp ceiling)
    • Max: 2000.26
  • MLP Layers:
    • Median: 4.70 โœ“ EXCELLENT
    • Mean: 4.86
    • Max: 8.40
  • LM Head: 261.88 โœ“ GOOD

Singular Values

  • Smallest ฯƒ_min: 5.42e-4 (well above ฯƒ_floor of 1e-4)
  • Top 5 smallest:
    • layers.0.self_attn.o_proj: 5.42e-4
    • layers.5.self_attn.o_proj: 5.44e-4
    • layers.9.self_attn.o_proj: 6.07e-4
    • layers.22.self_attn.o_proj: 6.09e-4
    • layers.4.self_attn.o_proj: 6.15e-4

Key Finding: Many attention projection layers (q_proj, o_proj) are actively hitting the spectral clamp ceiling of 2000, indicating the regularization is working as intended to prevent ill-conditioning.

Training Dynamics (Steps 33K-37K)

  • Loss (100-step avg): 2.081 โ†’ 2.146 (+0.066)
    • Note: Slight increase suggests model encountered harder examples or dataset shift
  • Gradient Norm: Median 1.90, P95 4.29
  • Gradient Clipping Rate: 3.6% (well-controlled)
  • Learning Rate: Minimal decay (0.000219 โ†’ 0.000218)

Generation Quality

Generates coherent, fluent legal text with no repetition issues. Suitable for legal/contractual content generation and analysis.

Usage

from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-006-170m-checkpoint-37000",
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
outputs = generator(
    "This Agreement is entered into as of",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(outputs[0]['generated_text'])

Model Card Authors

Alea Institute

Citation

For technical details, see the paper: https://arxiv.org/abs/2504.07854

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training},
  author={Alea Institute},
  year={2025},
  url={https://arxiv.org/abs/2504.07854},
  note={Trained with Muon optimizer and spectral clamping}
}

License

Apache 2.0

Downloads last month
5
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support