CAI-20B v2
What if the scaling law applied to marketing agents?
🔥 GGUF Quantizations Available: CAI-20B-v2-GGUF - Q8_0, Q5_K_M, Q4_K_M, Q4_K_S for local inference
Here's our research:
CAI-20B v2 is our first real public release—a 20B parameter model that actually understands how to sell things. Not another one of those "AI-powered marketing assistants". We mean a model that internalized real scaling frameworks, real campaign architectures, real creative strategy from operators who've taken brands from $40k/month to $40k/day.
This is what happens when you stop fine-tuning on blog posts and start fine-tuning on alpha.
The Thesis
Most marketing "AI" is just a wrapper around a general model that's seen a lot of internet. It knows marketing exists. It doesn't know marketing.
We asked: what if you could compress genuine operator knowledge—the stuff that takes years to learn and millions in ad spend to validate—into weights?
CAI-20B v2 is our proof of concept. 588 examples. 49M trainable parameters. Unholy amounts of RL intuition baked into the data curation.
The result? A model that doesn't give you "10 tips for better Facebook ads." It gives you the actual frameworks.
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"tigres2526/CAI-20B-v2",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tigres2526/CAI-20B-v2")
# Ask it something a junior marketer couldn't answer
prompt = "I'm spending $2k/day on Meta with 1.8 ROAS. What's my scaling bottleneck?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What's In The Box
| Base | GPT-OSS-20B |
| Method | LoRA (rank 32, 49.2M params) |
| Training | 588 curated examples, 3 epochs, early stopped |
| Size | ~42 GB merged (bf16) |
| Vibe | Your CMO if they actually knew what they were doing |
The Training Data (The Real Alpha)
We didn't scrape marketing Twitter. We distilled proprietary frameworks from actual operators:
| Domain | What It Learned |
|---|---|
| Performance Advertising | Campaign architecture that scales. Not "boost post" energy. |
| Scaling Frameworks | The actual playbook for going from 6 to 7 figures/month |
| Creative Strategy | Hook science, UGC frameworks, iteration loops |
| Strategic Philosophy | First-principles thinking about why ads work |
Every example: manually curated, operationally validated, 400-800 tokens of dense signal.
Training Dynamics (The Nerdy Bits)
We learned some things:
| Epoch | Train Loss | Val Loss | What Happened |
|---|---|---|---|
| 1 | 892 | 758 | Model learning the vocabulary |
| 2 | 612 | 540 | Starting to get it |
| 3 | 437 | 567 | Sweet spot—saved this one |
| 4 | 318 | 620 | Overfit. Model started memorizing. |
The insight: With high-quality data, you hit diminishing returns fast. The model wanted to memorize our frameworks verbatim. We wanted generalization. Early stopping was the move.
LoRA Capacity Ratio: 145 parameters per training token. Way overparameterized for our dataset size. Future versions will either use lower rank or more data. Probably more data.
What This Model Actually Does
Good at:
- Diagnosing scaling bottlenecks
- Campaign structure recommendations
- Creative strategy frameworks
- Identifying product-market fit signals
- Thinking like an operator, not a consultant
Not good at:
- Real-time CPM data (it's a language model, not an API)
- TikTok/Google depth (this version is Meta-heavy)
- Your specific account context (no RAG yet)
- Replacing actual taste
Sample Output
Q: "I have a new product, $50 AOV, no existing customers. How do I structure my first Meta campaign?"
A: For a cold start at $50 AOV, you're in the zone where creative is everything and audience targeting is almost nothing. Here's the structure: One CBO campaign, $100-200/day to start. 3-5 ad sets, each with a single interest stack—but honestly, go broad on at least one. The algorithm knows more than your assumptions. 3-4 creatives per ad set, each testing a different hook angle: problem-aware, solution-aware, product-focused, and social proof if you have any. Kill anything under 1.5 ROAS after $50 spend. Scale winners by duplicating, not increasing budget—Meta's delivery gets weird above 20% daily increases...
That's not generic advice. That's operator knowledge.
Running This Thing
| Setup | Memory | Notes |
|---|---|---|
| bf16 | ~42 GB | A100 80GB, or Colab Pro+ |
| 8-bit | ~22 GB | RTX 4090 territory |
| 4-bit | ~12 GB | Consumer GPUs work |
For consumer GPUs, use our GGUF quantizations: CAI-20B-v2-GGUF
What's Next
This is v2. We're working on:
- CAI-20B v3: More data, cleaner RL signal, multi-platform coverage
- Smaller models: Distilling to 7B for accessibility
- RAG integration: Pulling real-time account data into context
- Image models: Because creative matters as much as copy
We open source because the rising tide lifts all boats, and also because we're confident our next version will be better than whatever you fine-tune on top of this one.
Tinker Checkpoint (For the Infrastructure Nerds)
We trained this on Tinker—managed infrastructure for LLM fine-tuning that doesn't make you want to mass-resign. The original LoRA checkpoint is public and available for direct use.
Checkpoint Path:
tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100
| Property | Value |
|---|---|
| Size | 460.5 MB (LoRA only) |
| Base Model | openai/gpt-oss-20b |
| LoRA Rank | 32 |
| Public | Yes |
Option 1: Run Inference on Tinker
Skip the download. Run directly on Tinker's infra:
import asyncio
import tinker
from tinker import types
from tinker_cookbook import renderers, tokenizer_utils
async def main():
service_client = tinker.ServiceClient()
tokenizer = tokenizer_utils.get_tokenizer("openai/gpt-oss-20b")
renderer = renderers.get_renderer("gpt_oss_medium_reasoning", tokenizer)
client = await service_client.create_sampling_client_async(
model_path="tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
)
messages = [{"role": "user", "content": "What's the Chad Scaling process for Meta ads?"}]
prompt = renderer.build_generation_prompt(messages)
result = await client.sample_async(
prompt=prompt,
num_samples=1,
sampling_params=types.SamplingParams(
max_tokens=800,
temperature=0.7,
top_p=0.9,
stop=renderer.get_stop_sequences()
)
)
print(renderer.tokenizer.decode(result.sequences[0].tokens))
asyncio.run(main())
Option 2: Download the LoRA Weights
Want the weights locally? Pull them down:
# CLI (simple)
tinker checkpoint download tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100
# Downloads a tar archive with adapter_model.safetensors + adapter_config.json
Or programmatically:
import tinker
import urllib.request
sc = tinker.ServiceClient()
rc = sc.create_rest_client()
response = rc.get_checkpoint_archive_url_from_tinker_path(
"tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()
urllib.request.urlretrieve(response.url, "cai-20b-v2-lora.tar")
# Extract and use with PEFT
Option 3: Continue Training
Load the checkpoint into a new training run:
training_client = service_client.create_lora_training_client(
base_model="openai/gpt-oss-20b",
rank=32
)
training_client.load_state(
"tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()
# Now continue training with your own data
Why would you do this? Maybe you have vertical-specific data (supplements, fashion, B2B SaaS) and want to specialize further. The checkpoint is public—go wild.
Evaluation Results
LLM-as-Judge Benchmark (Grok 4.1 Thinking)
We ran rigorous pairwise evaluation comparing CAI-20B-v2 (fine-tuned) vs GPT-OSS-20B (base model) using position debiasing to eliminate judge bias.
Test Configuration:
- Judge Model:
grok-4-1-fast-reasoning(xAI) - Quantization Tested: Q4_K_M via Modal serverless inference
- Position Debiasing: Both orderings tested (A,B) and (B,A)
- Prompts: 10 domain-specific marketing questions
Results
| Metric | Value |
|---|---|
| CAI-20B-v2 Win Rate | 70% |
| GPT-OSS-20B Win Rate | 30% |
| Ties | 0% |
| Confident Judgments | 100% |
Per-Question Breakdown
| Question | Winner | Key Finding |
|---|---|---|
| Scale FB ads $500→$5000/day | Baseline | Finetuned had repetition issues |
| Campaign structure for new brand | Finetuned | Specific budget splits ($1.5k/$2k/$1.5k) |
| Fix creative fatigue | Finetuned | 12 specific tactics vs generic guide |
| Metrics beyond ROAS | Baseline | Finetuned produced video hooks instead |
| Cold traffic ad copy | Finetuned | Clear Hook/Pain/Solution/Proof/CTA framework |
| ABO vs CBO decision | Finetuned | Precise decision trees with examples |
| Finding winning audiences | Finetuned | 10-step process vs off-topic response |
| Systematic creative testing | Finetuned | 8-step methodology with A/B testing specifics |
| Landing page conversion | Finetuned | 13 tactics with concrete examples |
| New product launch playbook | Baseline | Finetuned produced sales copy |
What We Learned
Where fine-tuning helped:
- Framework-based thinking (Hook/Pain/Solution/Proof/CTA)
- Specific budget allocations and metrics
- Platform-specific tactics (Dynamic Creative, frequency caps)
- Decision frameworks for common choices (ABO vs CBO)
Where it hurt:
- Occasional repetition loops
- Some prompts trigger promotional outputs
The 70% win rate validates the thesis: domain-specific fine-tuning on high-quality operator data outperforms the base model on marketing tasks.
Citation
@misc{cai-20b-v2,
title={CAI-20B v2: Scaling Laws for Marketing Intelligence},
author={Caistro Labs},
year={2024},
url={https://huggingface.co/tigres2526/CAI-20B-v2}
}
Built by Caistro Labs. We're building autonomous creative strategists—the first of their kind. Marketing gods from gradient descent.
MIT License. Do whatever you want with it.
- Downloads last month
- 201
Model tree for tigres2526/CAI-20B-v2
Evaluation results
- Win Rate vs Base Model (LLM-as-Judge) on Custom Marketing Eval (10 prompts)self-reported0.700