InfraMind-DAPO: Infrastructure-as-Code Model with Direct Advantage Policy Optimization

InfraMind-DAPO is a 0.5B parameter language model fine-tuned for Infrastructure-as-Code (IaC) generation using DAPO (Direct Advantage Policy Optimization) - an advanced reinforcement learning technique that builds upon GRPO.

Model Description

Attribute Value
Base Model inframind-0.5b-grpo
Original Base Qwen/Qwen2.5-0.5B-Instruct
Parameters 500M
Training Method DAPO (Direct Advantage Policy Optimization)
Domain Infrastructure-as-Code
License MIT

Training Pipeline

Qwen2.5-0.5B-Instruct โ†’ GRPO Training โ†’ inframind-grpo โ†’ DAPO Training โ†’ inframind-dapo
                        (Stage 1)                        (Stage 2 - This Model)

This model is the second stage of InfraMind training, starting from the GRPO-trained checkpoint and applying DAPO innovations for enhanced learning.

What is DAPO?

Direct Advantage Policy Optimization (DAPO) is an advanced RL algorithm that improves upon GRPO with four key innovations:

Innovation Description Benefit
Clip-Higher Asymmetric clipping (ฮต_low=0.2, ฮต_high=0.28) Allows high-advantage tokens to be reinforced more strongly
Dynamic Sampling Skip batches with uniform rewards Prevents entropy collapse, maintains exploration
Token-Level Loss Per-token policy gradient Finer-grained credit assignment
Overlong Punishment Soft length penalty Prevents verbose, repetitive outputs

Why DAPO After GRPO?

Stage Method Purpose
Stage 1 GRPO Establish IaC generation capability from base model
Stage 2 DAPO Refine with advanced techniques for quality improvement

Evaluation Results

Model Training Method Accuracy Pass Threshold
inframind-grpo GRPO 97.3% 0.6
inframind-dapo DAPO 96.4% 0.6
Base (Qwen2.5-0.5B) None ~30% 0.6

Evaluated on InfraMind-Bench (110 held-out test samples) across:

  • Terraform (AWS, GCP, Azure)
  • Kubernetes (Deployments, Services, Ingress)
  • Docker (Dockerfile, docker-compose)
  • CI/CD (GitHub Actions, GitLab CI)

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load DAPO model
model = AutoModelForCausalLM.from_pretrained("srallabandi0225/inframind-0.5b-dapo")
tokenizer = AutoTokenizer.from_pretrained("srallabandi0225/inframind-0.5b-dapo")

# Generate Terraform
prompt = """### Instruction:
Create Terraform for AWS EC2 instance
### Input:
t3.micro instance type
### Response:
"""

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example Output

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.micro"

  tags = {
    Name = "web-server"
  }
}

Supported IaC Categories

Category Examples Coverage
Terraform EC2, S3, VPC, RDS, EKS, Lambda, IAM AWS, GCP, Azure
Kubernetes Deployment, Service, Ingress, ConfigMap, RBAC All K8s resources
Docker Dockerfile, docker-compose Multi-stage builds
CI/CD GitHub Actions, GitLab CI, Jenkins Workflows, pipelines
Ansible Playbooks, roles Server configuration
Helm Charts, values.yaml K8s package management

Training Details

DAPO Configuration

Training:
  epochs: 2
  batch_size: 16 (effective)
  learning_rate: 5e-6
  beta (KL): 0.0  # Pure DAPO - no KL penalty
  generations_per_prompt: 8

DAPO Innovations:
  clip_higher:
    epsilon_low: 0.2
    epsilon_high: 0.28
  dynamic_sampling: true
  token_level_loss: true
  overlong_punishment:
    enabled: true
    soft_penalty: true

LoRA:
  r: 16
  alpha: 32
  target_modules: [q_proj, k_proj, v_proj, o_proj]

Reward Function

Domain-specific reward for IaC quality:

Reward = ฮฑ ร— Syntax + ฮฒ ร— Correctness + ฮณ ร— Format

Where:
- Syntax (ฮฑ=0.4): Valid resource declarations
- Correctness (ฮฒ=0.3): Correct resource types
- Format (ฮณ=0.3): Proper structure

GRPO vs DAPO Comparison

Aspect GRPO DAPO
KL Penalty ฮฒ=0.04 ฮฒ=0.0 (none)
Clipping Symmetric Asymmetric (Clip-Higher)
Loss Granularity Sequence-level Token-level
Sampling All batches Dynamic (skip uniform)
Length Control None Overlong punishment

Hardware Requirements

Deployment Memory GPU
Training 16GB+ A100/A10G
Inference 2GB Optional
Edge (Raspberry Pi 5) 4GB None

The 0.5B model is small enough to run on edge devices, making it suitable for:

  • Air-gapped environments
  • Local development
  • CI/CD pipelines
  • IoT/Edge infrastructure

Limitations

  • IaC-specific: Optimized for infrastructure tasks, not general conversation
  • English only: Training data is in English
  • No execution: Generates code, does not execute or validate against real infrastructure
  • Version-sensitive: Generated code may use older API versions
  • Security: Always review generated code for security best practices

Out-of-Scope Uses

  • Legal or medical advice
  • General-purpose chatbot
  • Executing infrastructure changes without human review
  • Production deployment without validation

Intended Use

Primary Use Cases

  • Generating Terraform configurations
  • Creating Kubernetes manifests
  • Writing Dockerfiles and docker-compose
  • Building CI/CD pipelines
  • Infrastructure automation scripting

Users

  • DevOps engineers
  • Platform engineers
  • SREs
  • Cloud architects
  • Infrastructure developers

Training Data

InfraMind-Bench: 2000+ IaC tasks in Alpaca format

Category Tasks
Terraform 500+
Kubernetes 400+
Docker 300+
CI/CD 300+
Ansible 200+
Helm 150+
Monitoring 150+

Ethical Considerations

  • Model may generate insecure configurations if not prompted for security
  • Generated infrastructure code should always be reviewed before deployment
  • Model does not have access to real infrastructure or credentials
  • Users are responsible for validating generated code against their security policies

Citation

@misc{rallabandi2024inframind,
  title={InfraMind: Fine-tuning Small Language Models for Infrastructure-as-Code Generation with Reinforcement Learning},
  author={Rallabandi, Sai Kiran},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/srallabandi0225/inframind-0.5b-dapo}
}

Links

Acknowledgments

Model Card Contact

Author: Sai Kiran Rallabandi GitHub: @saikiranrallabandi

Downloads last month
12
Safetensors
Model size
0.5B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for srallabandi0225/inframind-0.5b-dapo

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(1)
this model
Quantizations
1 model

Evaluation results