Awesome work, thanks for sharing this!
I ran some experiments comparing finePdfs and Cosmopedia v2. Cosmopedia v2 consistently gives lower loss and perplexity, but generalization and benchmark scores drop, which makes sense given the more synthetic and easier nature of the data.
My question is more about scaling: do you think the dataset mix you propose can work as well when training larger models than the GPT-2 70M used in your experiments?
In my case, I’m training a significantly larger model (198M parameters, heterogeneous MoE), and I’m trying to understand whether the same mix should still be a good default choice, or if its effectiveness is more tied to the smaller GPT-2 regime.
I’d really appreciate your intuition on whether the mix itself scales, independent of architecture details.