CAI-20B v2

What if the scaling law applied to marketing agents?

🔥 GGUF Quantizations Available: CAI-20B-v2-GGUF - Q8_0, Q5_K_M, Q4_K_M, Q4_K_S for local inference

Here's our research:

CAI-20B v2 is our first real public release—a 20B parameter model that actually understands how to sell things. Not another one of those "AI-powered marketing assistants". We mean a model that internalized real scaling frameworks, real campaign architectures, real creative strategy from operators who've taken brands from $40k/month to $40k/day.

This is what happens when you stop fine-tuning on blog posts and start fine-tuning on alpha.


The Thesis

Most marketing "AI" is just a wrapper around a general model that's seen a lot of internet. It knows marketing exists. It doesn't know marketing.

We asked: what if you could compress genuine operator knowledge—the stuff that takes years to learn and millions in ad spend to validate—into weights?

CAI-20B v2 is our proof of concept. 588 examples. 49M trainable parameters. Unholy amounts of RL intuition baked into the data curation.

The result? A model that doesn't give you "10 tips for better Facebook ads." It gives you the actual frameworks.


Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "tigres2526/CAI-20B-v2",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("tigres2526/CAI-20B-v2")

# Ask it something a junior marketer couldn't answer
prompt = "I'm spending $2k/day on Meta with 1.8 ROAS. What's my scaling bottleneck?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What's In The Box

Base GPT-OSS-20B
Method LoRA (rank 32, 49.2M params)
Training 588 curated examples, 3 epochs, early stopped
Size ~42 GB merged (bf16)
Vibe Your CMO if they actually knew what they were doing

The Training Data (The Real Alpha)

We didn't scrape marketing Twitter. We distilled proprietary frameworks from actual operators:

Domain What It Learned
Performance Advertising Campaign architecture that scales. Not "boost post" energy.
Scaling Frameworks The actual playbook for going from 6 to 7 figures/month
Creative Strategy Hook science, UGC frameworks, iteration loops
Strategic Philosophy First-principles thinking about why ads work

Every example: manually curated, operationally validated, 400-800 tokens of dense signal.


Training Dynamics (The Nerdy Bits)

We learned some things:

Epoch Train Loss Val Loss What Happened
1 892 758 Model learning the vocabulary
2 612 540 Starting to get it
3 437 567 Sweet spot—saved this one
4 318 620 Overfit. Model started memorizing.

The insight: With high-quality data, you hit diminishing returns fast. The model wanted to memorize our frameworks verbatim. We wanted generalization. Early stopping was the move.

LoRA Capacity Ratio: 145 parameters per training token. Way overparameterized for our dataset size. Future versions will either use lower rank or more data. Probably more data.


What This Model Actually Does

Good at:

  • Diagnosing scaling bottlenecks
  • Campaign structure recommendations
  • Creative strategy frameworks
  • Identifying product-market fit signals
  • Thinking like an operator, not a consultant

Not good at:

  • Real-time CPM data (it's a language model, not an API)
  • TikTok/Google depth (this version is Meta-heavy)
  • Your specific account context (no RAG yet)
  • Replacing actual taste

Sample Output

Q: "I have a new product, $50 AOV, no existing customers. How do I structure my first Meta campaign?"

A: For a cold start at $50 AOV, you're in the zone where creative is everything and audience targeting is almost nothing. Here's the structure: One CBO campaign, $100-200/day to start. 3-5 ad sets, each with a single interest stack—but honestly, go broad on at least one. The algorithm knows more than your assumptions. 3-4 creatives per ad set, each testing a different hook angle: problem-aware, solution-aware, product-focused, and social proof if you have any. Kill anything under 1.5 ROAS after $50 spend. Scale winners by duplicating, not increasing budget—Meta's delivery gets weird above 20% daily increases...

That's not generic advice. That's operator knowledge.


Running This Thing

Setup Memory Notes
bf16 ~42 GB A100 80GB, or Colab Pro+
8-bit ~22 GB RTX 4090 territory
4-bit ~12 GB Consumer GPUs work

For consumer GPUs, use our GGUF quantizations: CAI-20B-v2-GGUF


What's Next

This is v2. We're working on:

  • CAI-20B v3: More data, cleaner RL signal, multi-platform coverage
  • Smaller models: Distilling to 7B for accessibility
  • RAG integration: Pulling real-time account data into context
  • Image models: Because creative matters as much as copy

We open source because the rising tide lifts all boats, and also because we're confident our next version will be better than whatever you fine-tune on top of this one.


Tinker Checkpoint (For the Infrastructure Nerds)

We trained this on Tinker—managed infrastructure for LLM fine-tuning that doesn't make you want to mass-resign. The original LoRA checkpoint is public and available for direct use.

Checkpoint Path:

tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100
Property Value
Size 460.5 MB (LoRA only)
Base Model openai/gpt-oss-20b
LoRA Rank 32
Public Yes

Option 1: Run Inference on Tinker

Skip the download. Run directly on Tinker's infra:

import asyncio
import tinker
from tinker import types
from tinker_cookbook import renderers, tokenizer_utils

async def main():
    service_client = tinker.ServiceClient()
    tokenizer = tokenizer_utils.get_tokenizer("openai/gpt-oss-20b")
    renderer = renderers.get_renderer("gpt_oss_medium_reasoning", tokenizer)

    client = await service_client.create_sampling_client_async(
        model_path="tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
    )

    messages = [{"role": "user", "content": "What's the Chad Scaling process for Meta ads?"}]
    prompt = renderer.build_generation_prompt(messages)

    result = await client.sample_async(
        prompt=prompt,
        num_samples=1,
        sampling_params=types.SamplingParams(
            max_tokens=800,
            temperature=0.7,
            top_p=0.9,
            stop=renderer.get_stop_sequences()
        )
    )

    print(renderer.tokenizer.decode(result.sequences[0].tokens))

asyncio.run(main())

Option 2: Download the LoRA Weights

Want the weights locally? Pull them down:

# CLI (simple)
tinker checkpoint download tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100

# Downloads a tar archive with adapter_model.safetensors + adapter_config.json

Or programmatically:

import tinker
import urllib.request

sc = tinker.ServiceClient()
rc = sc.create_rest_client()
response = rc.get_checkpoint_archive_url_from_tinker_path(
    "tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()

urllib.request.urlretrieve(response.url, "cai-20b-v2-lora.tar")
# Extract and use with PEFT

Option 3: Continue Training

Load the checkpoint into a new training run:

training_client = service_client.create_lora_training_client(
    base_model="openai/gpt-oss-20b",
    rank=32
)
training_client.load_state(
    "tinker://8e3781a1-328c-59bf-b2e2-a2a10543ec7d:train:0/sampler_weights/best-step-00100"
).result()

# Now continue training with your own data

Why would you do this? Maybe you have vertical-specific data (supplements, fashion, B2B SaaS) and want to specialize further. The checkpoint is public—go wild.


Evaluation Results

LLM-as-Judge Benchmark (Grok 4.1 Thinking)

We ran rigorous pairwise evaluation comparing CAI-20B-v2 (fine-tuned) vs GPT-OSS-20B (base model) using position debiasing to eliminate judge bias.

Test Configuration:

  • Judge Model: grok-4-1-fast-reasoning (xAI)
  • Quantization Tested: Q4_K_M via Modal serverless inference
  • Position Debiasing: Both orderings tested (A,B) and (B,A)
  • Prompts: 10 domain-specific marketing questions

Results

Metric Value
CAI-20B-v2 Win Rate 70%
GPT-OSS-20B Win Rate 30%
Ties 0%
Confident Judgments 100%

Per-Question Breakdown

Question Winner Key Finding
Scale FB ads $500→$5000/day Baseline Finetuned had repetition issues
Campaign structure for new brand Finetuned Specific budget splits ($1.5k/$2k/$1.5k)
Fix creative fatigue Finetuned 12 specific tactics vs generic guide
Metrics beyond ROAS Baseline Finetuned produced video hooks instead
Cold traffic ad copy Finetuned Clear Hook/Pain/Solution/Proof/CTA framework
ABO vs CBO decision Finetuned Precise decision trees with examples
Finding winning audiences Finetuned 10-step process vs off-topic response
Systematic creative testing Finetuned 8-step methodology with A/B testing specifics
Landing page conversion Finetuned 13 tactics with concrete examples
New product launch playbook Baseline Finetuned produced sales copy

What We Learned

Where fine-tuning helped:

  • Framework-based thinking (Hook/Pain/Solution/Proof/CTA)
  • Specific budget allocations and metrics
  • Platform-specific tactics (Dynamic Creative, frequency caps)
  • Decision frameworks for common choices (ABO vs CBO)

Where it hurt:

  • Occasional repetition loops
  • Some prompts trigger promotional outputs

The 70% win rate validates the thesis: domain-specific fine-tuning on high-quality operator data outperforms the base model on marketing tasks.


Citation

@misc{cai-20b-v2,
  title={CAI-20B v2: Scaling Laws for Marketing Intelligence},
  author={Caistro Labs},
  year={2024},
  url={https://huggingface.co/tigres2526/CAI-20B-v2}
}

Built by Caistro Labs. We're building autonomous creative strategists—the first of their kind. Marketing gods from gradient descent.

MIT License. Do whatever you want with it.

Downloads last month
201
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tigres2526/CAI-20B-v2

Base model

openai/gpt-oss-20b
Finetuned
(426)
this model
Quantizations
3 models

Evaluation results

  • Win Rate vs Base Model (LLM-as-Judge) on Custom Marketing Eval (10 prompts)
    self-reported
    0.700