# A100 Large Scale Training Guide

This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.

## Available Configurations

### 1. A100 Large Batch Configuration
**File**: `config/train_smollm3_openhermes_fr_a100_large.py`

**Key Features**:
- **Effective Batch Size**: 128 (8 × 16 gradient accumulation)
- **Training Duration**: ~1.3 passes (8,000 steps)
- **Learning Rate**: 5e-6 (optimized for large batches)
- **Mixed Precision**: bf16 (A100 optimized)
- **Sequence Length**: 8192 tokens
- **Memory Optimizations**: No gradient checkpointing for A100 efficiency

**Estimated Training Time**: ~6-8 hours on A100

### 2. Multiple Passes Configuration
**File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`

**Key Features**:
- **Effective Batch Size**: 120 (6 × 20 gradient accumulation)
- **Training Duration**: ~4 passes (25,000 steps)
- **Learning Rate**: 3e-6 (conservative for long training)
- **Warmup Steps**: 2000 (longer warmup for stability)
- **Checkpoint Strategy**: More frequent saves (every 2000 steps)

**Estimated Training Time**: ~20-24 hours on A100

## Training Commands

### Quick Start - Large Batch Experiment
```bash
python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --experiment-name "smollm3_openhermes_fr_large_batch" \
    --output-dir ./outputs/large_batch
```

### Multiple Passes Experiment
```bash
python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --experiment-name "smollm3_openhermes_fr_multiple_passes" \
    --output-dir ./outputs/multiple_passes
```

### Dry Run (Check Configuration)
```bash
python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_large.py \
    --dry-run
```

### Resume Training
```bash
python run_a100_large_experiment.py \
    --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
    --resume ./outputs/multiple_passes/checkpoint-10000 \
    --output-dir ./outputs/multiple_passes
```

## Configuration Details

### Memory Usage Optimization
- **Gradient Checkpointing**: Disabled for A100 efficiency
- **Flash Attention**: Enabled for memory efficiency
- **bf16 Mixed Precision**: Better for A100 than fp16
- **Gradient Clipping**: 1.0 for stability
- **Group by Length**: Enabled for better batching

### Data Loading Optimization
- **Num Workers**: 8 for faster data loading
- **Pin Memory**: Enabled for GPU transfer efficiency
- **Prefetch Factor**: 2 for pipeline optimization

### Training Stability
- **Conservative Learning Rate**: Lower LR for large effective batch sizes
- **Longer Warmup**: More warmup steps for stability
- **Higher Beta2**: 0.999 for AdamW stability
- **Gradient Clipping**: Prevents gradient explosion

## Expected Results

### Large Batch Configuration (1.3 passes)
- **Training Steps**: 8,000
- **Effective Batch Size**: 128
- **Steps per Epoch**: ~6,250
- **Epochs**: ~1.3
- **Expected Loss**: Should converge to ~1.5-2.0

### Multiple Passes Configuration (4 passes)
- **Training Steps**: 25,000
- **Effective Batch Size**: 120
- **Steps per Epoch**: ~6,667
- **Epochs**: ~3.75
- **Expected Loss**: Should converge to ~1.2-1.5

## Monitoring and Logging

### Trackio Integration
Both configurations include Trackio monitoring:
- **Metrics Logging**: Every 25-50 steps
- **Artifact Logging**: Model checkpoints
- **Config Logging**: Training configuration

### Checkpoint Strategy
- **Large Batch**: Save every 1000 steps (8 checkpoints)
- **Multiple Passes**: Save every 2000 steps (12 checkpoints)
- **Best Model**: Automatically load best model at end

## Hardware Requirements

### Minimum Requirements
- **GPU**: A100 80GB (or multiple A100s)
- **RAM**: 64GB+ system RAM
- **Storage**: 100GB+ for checkpoints and logs
- **Network**: Fast internet for dataset download

### Recommended Setup
- **GPU**: 2-4x A100 80GB
- **RAM**: 128GB+ system RAM
- **Storage**: 500GB+ NVMe SSD
- **Network**: 10Gbps+ connection

## Troubleshooting

### Out of Memory (OOM)
If you encounter OOM errors:
1. Reduce `batch_size` from 8 to 6 or 4
2. Increase `gradient_accumulation_steps` to maintain effective batch size
3. Reduce `max_seq_length` from 8192 to 4096

### Slow Training
If training is too slow:
1. Increase `dataloader_num_workers` to 12-16
2. Ensure you're using bf16 mixed precision
3. Check that gradient checkpointing is disabled
4. Verify flash attention is enabled

### Convergence Issues
If loss doesn't converge:
1. Reduce learning rate by 2x
2. Increase warmup steps
3. Check gradient norms in logs
4. Verify dataset quality

## Customization

### For Different Dataset Sizes
Adjust `max_iters` based on your dataset size:
```python
# For 1M datapoints with effective batch size 120
steps_per_epoch = 1000000 // 120  # ~8,333 steps
max_iters = steps_per_epoch * desired_epochs
```

### For Different GPU Memory
Adjust batch size and gradient accumulation:
```python
# For 40GB A100
batch_size = 4
gradient_accumulation_steps = 32  # Effective batch size = 128

# For 24GB GPU
batch_size = 2
gradient_accumulation_steps = 64  # Effective batch size = 128
```

## Performance Tips

1. **Use bf16**: Better than fp16 for A100
2. **Disable Gradient Checkpointing**: A100 has enough memory
3. **Use Flash Attention**: Memory efficient attention
4. **Group by Length**: Better batching efficiency
5. **Pin Memory**: Faster GPU transfers
6. **Multiple Workers**: Faster data loading

## Expected Timeline

- **Large Batch**: 6-8 hours for 1.3 passes
- **Multiple Passes**: 20-24 hours for 4 passes
- **Full Dataset (5+ passes)**: 30+ hours

## Next Steps

After training completes:
1. Evaluate on validation set
2. Test generation quality
3. Push to Hugging Face Hub
4. Deploy for inference

For deployment instructions, see `DEPLOYMENT_GUIDE.md`.