CAI-20B v2

What if the scaling law applied to marketing agents?

🔥 GGUF Quantizations Available: CAI-20B-v2-GGUF - Q8_0, Q5_K_M, Q4_K_M, Q4_K_S for local inference

Here's our research:

CAI-20B v2 is our first real public release—a 20B parameter model that actually understands how to sell things. Not another one of those "AI-powered marketing assistants". We mean a model that internalized real scaling frameworks, real campaign architectures, real creative strategy from operators who've taken brands from $40k/month to $40k/day.

This is what happens when you stop fine-tuning on blog posts and start fine-tuning on alpha.

The Thesis

Most marketing "AI" is just a wrapper around a general model that's seen a lot of internet. It knows marketing exists. It doesn't know marketing.

We asked: what if you could compress genuine operator knowledge—the stuff that takes years to learn and millions in ad spend to validate—into weights?

CAI-20B v2 is our proof of concept. 588 examples. 49M trainable parameters. Unholy amounts of RL intuition baked into the data curation.

The result? A model that doesn't give you "10 tips for better Facebook ads." It gives you the actual frameworks.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "tigres2526/CAI-20B-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tigres2526/CAI-20B-v2")

# Ask it something a junior marketer couldn't answer
prompt = "I'm spending $2k/day on Meta with 1.8 ROAS. What's my scaling bottleneck?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What's In The Box


Base	GPT-OSS-20B
Method	LoRA (rank 32, 49.2M params)
Training	588 curated examples, 3 epochs, early stopped
Size	~42 GB merged (bf16)
Vibe	Your CMO if they actually knew what they were doing

The Training Data (The Real Alpha)

We didn't scrape marketing Twitter. We distilled proprietary frameworks from actual operators:

Domain	What It Learned
Performance Advertising	Campaign architecture that scales. Not "boost post" energy.
Scaling Frameworks	The actual playbook for going from 6 to 7 figures/month
Creative Strategy	Hook science, UGC frameworks, iteration loops
Strategic Philosophy	First-principles thinking about why ads work

Every example: manually curated, operationally validated, 400-800 tokens of dense signal.

Training Dynamics (The Nerdy Bits)

We learned some things:

Epoch	Train Loss	Val Loss	What Happened
1	892	758	Model learning the vocabulary
2	612	540	Starting to get it
3	437	567	Sweet spot—saved this one
4	318	620	Overfit. Model started memorizing.

The insight: With high-quality data, you hit diminishing returns fast. The model wanted to memorize our frameworks verbatim. We wanted generalization. Early stopping was the move.

LoRA Capacity Ratio: 145 parameters per training token. Way overparameterized for our dataset size. Future versions will either use lower rank or more data. Probably more data.

What This Model Actually Does

Good at:

Diagnosing scaling bottlenecks
Campaign structure recommendations
Creative strategy frameworks
Identifying product-market fit signals
Thinking like an operator, not a consultant

Not good at:

Real-time CPM data (it's a language model, not an API)
TikTok/Google depth (this version is Meta-heavy)
Your specific account context (no RAG yet)
Replacing actual taste

Sample Output

Q: "I have a new product, $50 AOV, no existing customers. How do I structure my first Meta campaign?"

A: For a cold start at $50 AOV, you're in the zone where creative is everything and audience targeting is almost nothing. Here's the structure: One CBO campaign, $100-200/day to start. 3-5 ad sets, each with a single interest stack—but honestly, go broad on at least one. The algorithm knows more than your assumptions. 3-4 creatives per ad set, each testing a different hook angle: problem-aware, solution-aware, product-focused, and social proof if you have any. Kill anything under 1.5 ROAS after $50 spend. Scale winners by duplicating, not increasing budget—Meta's delivery gets weird above 20% daily increases...

That's not generic advice. That's operator knowledge.

Running This Thing

Setup	Memory	Notes
bf16	~42 GB	A100 80GB, or Colab Pro+
8-bit	~22 GB	RTX 4090 territory
4-bit	~12 GB	Consumer GPUs work

For consumer GPUs, use our GGUF quantizations: CAI-20B-v2-GGUF

What's Next

This is v2. We're working on:

CAI-20B v3: More data, cleaner RL signal, multi-platform coverage
Smaller models: Distilling to 7B for accessibility
RAG integration: Pulling real-time account data into context
Image models: Because creative matters as much as copy

We open source because the rising tide lifts all boats, and also because we're confident our next version will be better than whatever you fine-tune on top of this one.

Tinker Checkpoint (For the Infrastructure Nerds)

We trained this on Tinker—managed infrastructure for LLM fine-tuning that doesn't make you want to mass-resign. The original LoRA checkpoint is public and available for direct use.

Checkpoint Path:

tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100

Property	Value
Size	460.5 MB (LoRA only)
Base Model	openai/gpt-oss-20b
LoRA Rank	32
Public	Yes

Option 1: Run Inference on Tinker

Skip the download. Run directly on Tinker's infra:

import asyncio
import tinker
from tinker import types
from tinker_cookbook import renderers, tokenizer_utils

async def main():
    service_client = tinker.ServiceClient()
    tokenizer = tokenizer_utils.get_tokenizer("openai/gpt-oss-20b")
    renderer = renderers.get_renderer("gpt_oss_medium_reasoning", tokenizer)

    client = await service_client.create_sampling_client_async(
        model_path="tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
    )

    messages = [{"role": "user", "content": "What's the Chad Scaling process for Meta ads?"}]
    prompt = renderer.build_generation_prompt(messages)

    result = await client.sample_async(
        prompt=prompt,
        num_samples=1,
        sampling_params=types.SamplingParams(
            max_tokens=800,
            temperature=0.7,
            top_p=0.9,
            stop=renderer.get_stop_sequences()
        )
    )

    print(renderer.tokenizer.decode(result.sequences[0].tokens))

asyncio.run(main())

Option 2: Download the LoRA Weights

Want the weights locally? Pull them down:

# CLI (simple)
tinker checkpoint download tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100

# Downloads a tar archive with adapter_model.safetensors + adapter_config.json

Or programmatically:

import tinker
import urllib.request

sc = tinker.ServiceClient()
rc = sc.create_rest_client()
response = rc.get_checkpoint_archive_url_from_tinker_path(
    "tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()

urllib.request.urlretrieve(response.url, "cai-20b-v2-lora.tar")
# Extract and use with PEFT

Option 3: Continue Training

Load the checkpoint into a new training run:

training_client = service_client.create_lora_training_client(
    base_model="openai/gpt-oss-20b",
    rank=32
)
training_client.load_state(
    "tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()

# Now continue training with your own data

Why would you do this? Maybe you have vertical-specific data (supplements, fashion, B2B SaaS) and want to specialize further. The checkpoint is public—go wild.

Evaluation Results

LLM-as-Judge Benchmark (Grok 4.1 Thinking)

We ran rigorous pairwise evaluation comparing CAI-20B-v2 (fine-tuned) vs GPT-OSS-20B (base model) using position debiasing to eliminate judge bias.

Test Configuration:

Judge Model: grok-4-1-fast-reasoning (xAI)
Quantization Tested: Q4_K_M via Modal serverless inference
Position Debiasing: Both orderings tested (A,B) and (B,A)
Prompts: 10 domain-specific marketing questions

Results

Metric	Value
CAI-20B-v2 Win Rate	70%
GPT-OSS-20B Win Rate	30%
Ties	0%
Confident Judgments	100%

Per-Question Breakdown

Question	Winner	Key Finding
Scale FB ads $500→$5000/day	Baseline	Finetuned had repetition issues
Campaign structure for new brand	Finetuned	Specific budget splits ($1.5k/$2k/$1.5k)
Fix creative fatigue	Finetuned	12 specific tactics vs generic guide
Metrics beyond ROAS	Baseline	Finetuned produced video hooks instead
Cold traffic ad copy	Finetuned	Clear Hook/Pain/Solution/Proof/CTA framework
ABO vs CBO decision	Finetuned	Precise decision trees with examples
Finding winning audiences	Finetuned	10-step process vs off-topic response
Systematic creative testing	Finetuned	8-step methodology with A/B testing specifics
Landing page conversion	Finetuned	13 tactics with concrete examples
New product launch playbook	Baseline	Finetuned produced sales copy

What We Learned

Where fine-tuning helped:

Framework-based thinking (Hook/Pain/Solution/Proof/CTA)
Specific budget allocations and metrics
Platform-specific tactics (Dynamic Creative, frequency caps)
Decision frameworks for common choices (ABO vs CBO)

Where it hurt:

Occasional repetition loops
Some prompts trigger promotional outputs

The 70% win rate validates the thesis: domain-specific fine-tuning on high-quality operator data outperforms the base model on marketing tasks.

Citation

@misc{cai-20b-v2,
  title={CAI-20B v2: Scaling Laws for Marketing Intelligence},
  author={Caistro Labs},
  year={2024},
  url={https://huggingface.co/tigres2526/CAI-20B-v2}
}

Built by Caistro Labs. We're building autonomous creative strategists—the first of their kind. Marketing gods from gradient descent.

MIT License. Do whatever you want with it.

Downloads last month: 201

Safetensors

Model size

21B params

Tensor type

BF16

Model tree for tigres2526/CAI-20B-v2

Base model

openai/gpt-oss-20b

Finetuned

(426)

this model

Quantizations

3 models

Evaluation results

Win Rate vs Base Model (LLM-as-Judge) on Custom Marketing Eval (10 prompts)
self-reported

0.700