InfraMind-DAPO: Infrastructure-as-Code Model with Direct Advantage Policy Optimization
InfraMind-DAPO is a 0.5B parameter language model fine-tuned for Infrastructure-as-Code (IaC) generation using DAPO (Direct Advantage Policy Optimization) - an advanced reinforcement learning technique that builds upon GRPO.
Model Description
| Attribute | Value |
|---|---|
| Base Model | inframind-0.5b-grpo |
| Original Base | Qwen/Qwen2.5-0.5B-Instruct |
| Parameters | 500M |
| Training Method | DAPO (Direct Advantage Policy Optimization) |
| Domain | Infrastructure-as-Code |
| License | MIT |
Training Pipeline
Qwen2.5-0.5B-Instruct โ GRPO Training โ inframind-grpo โ DAPO Training โ inframind-dapo
(Stage 1) (Stage 2 - This Model)
This model is the second stage of InfraMind training, starting from the GRPO-trained checkpoint and applying DAPO innovations for enhanced learning.
What is DAPO?
Direct Advantage Policy Optimization (DAPO) is an advanced RL algorithm that improves upon GRPO with four key innovations:
| Innovation | Description | Benefit |
|---|---|---|
| Clip-Higher | Asymmetric clipping (ฮต_low=0.2, ฮต_high=0.28) | Allows high-advantage tokens to be reinforced more strongly |
| Dynamic Sampling | Skip batches with uniform rewards | Prevents entropy collapse, maintains exploration |
| Token-Level Loss | Per-token policy gradient | Finer-grained credit assignment |
| Overlong Punishment | Soft length penalty | Prevents verbose, repetitive outputs |
Why DAPO After GRPO?
| Stage | Method | Purpose |
|---|---|---|
| Stage 1 | GRPO | Establish IaC generation capability from base model |
| Stage 2 | DAPO | Refine with advanced techniques for quality improvement |
Evaluation Results
| Model | Training Method | Accuracy | Pass Threshold |
|---|---|---|---|
| inframind-grpo | GRPO | 97.3% | 0.6 |
| inframind-dapo | DAPO | 96.4% | 0.6 |
| Base (Qwen2.5-0.5B) | None | ~30% | 0.6 |
Evaluated on InfraMind-Bench (110 held-out test samples) across:
- Terraform (AWS, GCP, Azure)
- Kubernetes (Deployments, Services, Ingress)
- Docker (Dockerfile, docker-compose)
- CI/CD (GitHub Actions, GitLab CI)
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load DAPO model
model = AutoModelForCausalLM.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
tokenizer = AutoTokenizer.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
# Generate Terraform
prompt = """### Instruction:
Create Terraform for AWS EC2 instance
### Input:
t3.micro instance type
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Example Output
resource "aws_instance" "web" {
ami = "ami-0c55b159cbfafe1f0"
instance_type = "t3.micro"
tags = {
Name = "web-server"
}
}
Supported IaC Categories
| Category | Examples | Coverage |
|---|---|---|
| Terraform | EC2, S3, VPC, RDS, EKS, Lambda, IAM | AWS, GCP, Azure |
| Kubernetes | Deployment, Service, Ingress, ConfigMap, RBAC | All K8s resources |
| Docker | Dockerfile, docker-compose | Multi-stage builds |
| CI/CD | GitHub Actions, GitLab CI, Jenkins | Workflows, pipelines |
| Ansible | Playbooks, roles | Server configuration |
| Helm | Charts, values.yaml | K8s package management |
Training Details
DAPO Configuration
Training:
epochs: 2
batch_size: 16 (effective)
learning_rate: 5e-6
beta (KL): 0.0 # Pure DAPO - no KL penalty
generations_per_prompt: 8
DAPO Innovations:
clip_higher:
epsilon_low: 0.2
epsilon_high: 0.28
dynamic_sampling: true
token_level_loss: true
overlong_punishment:
enabled: true
soft_penalty: true
LoRA:
r: 16
alpha: 32
target_modules: [q_proj, k_proj, v_proj, o_proj]
Reward Function
Domain-specific reward for IaC quality:
Reward = ฮฑ ร Syntax + ฮฒ ร Correctness + ฮณ ร Format
Where:
- Syntax (ฮฑ=0.4): Valid resource declarations
- Correctness (ฮฒ=0.3): Correct resource types
- Format (ฮณ=0.3): Proper structure
GRPO vs DAPO Comparison
| Aspect | GRPO | DAPO |
|---|---|---|
| KL Penalty | ฮฒ=0.04 | ฮฒ=0.0 (none) |
| Clipping | Symmetric | Asymmetric (Clip-Higher) |
| Loss Granularity | Sequence-level | Token-level |
| Sampling | All batches | Dynamic (skip uniform) |
| Length Control | None | Overlong punishment |
Hardware Requirements
| Deployment | Memory | GPU |
|---|---|---|
| Training | 16GB+ | A100/A10G |
| Inference | 2GB | Optional |
| Edge (Raspberry Pi 5) | 4GB | None |
The 0.5B model is small enough to run on edge devices, making it suitable for:
- Air-gapped environments
- Local development
- CI/CD pipelines
- IoT/Edge infrastructure
Limitations
- IaC-specific: Optimized for infrastructure tasks, not general conversation
- English only: Training data is in English
- No execution: Generates code, does not execute or validate against real infrastructure
- Version-sensitive: Generated code may use older API versions
- Security: Always review generated code for security best practices
Out-of-Scope Uses
- Legal or medical advice
- General-purpose chatbot
- Executing infrastructure changes without human review
- Production deployment without validation
Intended Use
Primary Use Cases
- Generating Terraform configurations
- Creating Kubernetes manifests
- Writing Dockerfiles and docker-compose
- Building CI/CD pipelines
- Infrastructure automation scripting
Users
- DevOps engineers
- Platform engineers
- SREs
- Cloud architects
- Infrastructure developers
Training Data
InfraMind-Bench: 2000+ IaC tasks in Alpaca format
| Category | Tasks |
|---|---|
| Terraform | 500+ |
| Kubernetes | 400+ |
| Docker | 300+ |
| CI/CD | 300+ |
| Ansible | 200+ |
| Helm | 150+ |
| Monitoring | 150+ |
Ethical Considerations
- Model may generate insecure configurations if not prompted for security
- Generated infrastructure code should always be reviewed before deployment
- Model does not have access to real infrastructure or credentials
- Users are responsible for validating generated code against their security policies
Citation
@misc{rallabandi2024inframind,
title={InfraMind: Fine-tuning Small Language Models for Infrastructure-as-Code Generation with Reinforcement Learning},
author={Rallabandi, Sai Kiran},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/srallabandi0225/inframind-0.5b-dapo}
}
Links
- GitHub: github.com/saikiranrallabandi/inframind
- GRPO Model: srallabandi0225/inframind-0.5b-grpo
- DAPO Model: srallabandi0225/inframind-0.5b-dapo
Acknowledgments
- Qwen Team for the base model
- DeepSeek for GRPO
- NVIDIA NeMo for DAPO reference
- TRL for training infrastructure
Model Card Contact
Author: Sai Kiran Rallabandi GitHub: @saikiranrallabandi
- Downloads last month
- 12
Model tree for srallabandi0225/inframind-0.5b-dapo
Base model
Qwen/Qwen2.5-0.5BEvaluation results
- DAPO Accuracy on InfraMind-Benchself-reported96.400