Contamination results updated based on ``https://arxiv.org/abs/2311.06233``
Browse filesContamination results for IMDB, AG News, Yelp, RTE, WNLI, SAMSum, and XSum. Also, new contamination levels were found for HumanEval, DROP, and GSM8k datasets. GPT-4 and GPT3.5 are the base models in experiments.
- contamination_report.csv +18 -0
contamination_report.csv
CHANGED
|
@@ -1,5 +1,23 @@
|
|
| 1 |
Evaluation Dataset;Subset;Contaminated Source;Model or corpus;Train Split;Development Split;Test Split;Approach;Reference;PR
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
conll2003;;GPT-3.5;model;100.0;100.0;100.0;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|
| 4 |
nyu-mll/glue;mnli;GPT-3.5;model;100.0;100.0;;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|
| 5 |
rajpurkar/squad_v2;;GPT-3.5;model;100.0;100.0;;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|
|
|
|
| 1 |
Evaluation Dataset;Subset;Contaminated Source;Model or corpus;Train Split;Development Split;Test Split;Approach;Reference;PR
|
| 2 |
|
| 3 |
+
gsm8k;;GPT-4;model;79.00;;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 4 |
+
ucinlp/drop;;GPT-4;model;;44.00;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 5 |
+
openai_humaneval;;GPT-4;model;;;56.71;model-based;https://arxiv.org/abs/2311.06233;8
|
| 6 |
+
imdb;;GPT-4;model;;;82.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 7 |
+
imdb;;GPT-3.5;model;;;55.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 8 |
+
ag_news;;GPT-4;model;;;91.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 9 |
+
ag_news;;GPT-3.5;model;;;82.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 10 |
+
yelp_review_full;;GPT-4;model;;;80.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 11 |
+
yelp_review_full;;GPT-3.5;model;;;13.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 12 |
+
nyu-mll/glue;rte;GPT-4;model;;60.00;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 13 |
+
nyu-mll/glue;rte;GPT-3.5;model;;71.00;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 14 |
+
nyu-mll/glue;wnli;GPT-4;model;;50.70;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 15 |
+
nyu-mll/glue;wnli;GPT-3.5;model;;12.68;;model-based;https://arxiv.org/abs/2311.06233;8
|
| 16 |
+
samsum;;GPT-4;model;;;77.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 17 |
+
samsum;;GPT-3.5;model;;;74.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 18 |
+
EdinburghNLP/xsum;;GPT-4;model;;;95.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 19 |
+
EdinburghNLP/xsum;;GPT-3.5;model;;;79.00;model-based;https://arxiv.org/abs/2311.06233;8
|
| 20 |
+
|
| 21 |
conll2003;;GPT-3.5;model;100.0;100.0;100.0;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|
| 22 |
nyu-mll/glue;mnli;GPT-3.5;model;100.0;100.0;;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|
| 23 |
rajpurkar/squad_v2;;GPT-3.5;model;100.0;100.0;;model-based;https://hitz-zentroa.github.io/lm-contamination/blog/;7
|