nielsr HF Staff commited on
Commit
1ea9bed
·
verified ·
1 Parent(s): e9e4c0d

Improve model card for InternVL2_5-8B: Add full paper abstract

Browse files

This PR updates the model card for InternVL2_5-8B to provide a more comprehensive overview of the model.

The "Introduction" section has been replaced with the full abstract from the paper "[Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling](https://huggingface.co/papers/2412.05271)". This ensures that the model card offers a detailed and informative summary of the model's contributions and performance.

The redundant sentence "HuggingFace demo see this https URL" from the end of the abstract was removed, as a direct link to the Hugging Face demo space is already included in the model card's header.

All other existing content, including metadata, links to the paper, GitHub repository, project page, and sample usage code snippets, remain unchanged as they correctly fulfill the model card requirements. The `library_name: transformers` metadata is supported by the provided code examples in the "Quick Start" section.

Files changed (1) hide show
  1. README.md +74 -41
README.md CHANGED
@@ -1,18 +1,18 @@
1
  ---
2
- license: mit
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
  base_model:
6
- - OpenGVLab/InternViT-300M-448px-V2_5
7
- - internlm/internlm2_5-7b-chat
8
- base_model_relation: merge
 
9
  language:
10
- - multilingual
 
 
 
11
  tags:
12
- - internvl
13
- - custom_code
14
- datasets:
15
- - HuggingFaceFV/finevideo
16
  ---
17
 
18
  # InternVL2_5-8B
@@ -25,11 +25,9 @@ datasets:
25
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
  </div>
27
 
28
- ## Introduction
29
 
30
- We are excited to introduce **InternVL 2.5**, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality.
31
-
32
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/5HDAGOQOZvS1EtI107Ac-.png)
33
 
34
  ## InternVL 2.5 Family
35
 
@@ -121,14 +119,14 @@ To address this challenge and support future research, we designed an efficient
121
 
122
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
123
 
124
- 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
125
- 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
126
- 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
127
 
128
  For **multimodal data**, two strategies are used:
129
 
130
- 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
131
- 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
132
 
133
  #### Training Data
134
 
@@ -360,40 +358,50 @@ generation_config = dict(max_new_tokens=1024, do_sample=True)
360
  # pure-text conversation (纯文本对话)
361
  question = 'Hello, who are you?'
362
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
363
- print(f'User: {question}\nAssistant: {response}')
 
364
 
365
  question = 'Can you tell me a story?'
366
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
367
- print(f'User: {question}\nAssistant: {response}')
 
368
 
369
  # single-image single-round conversation (单图单轮对话)
370
- question = '<image>\nPlease describe the image shortly.'
 
371
  response = model.chat(tokenizer, pixel_values, question, generation_config)
372
- print(f'User: {question}\nAssistant: {response}')
 
373
 
374
  # single-image multi-round conversation (单图多轮对话)
375
- question = '<image>\nPlease describe the image in detail.'
 
376
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
377
- print(f'User: {question}\nAssistant: {response}')
 
378
 
379
  question = 'Please write a poem according to the image.'
380
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
381
- print(f'User: {question}\nAssistant: {response}')
 
382
 
383
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
384
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
385
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
386
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
387
 
388
- question = '<image>\nDescribe the two images in detail.'
 
389
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
390
  history=None, return_history=True)
391
- print(f'User: {question}\nAssistant: {response}')
 
392
 
393
  question = 'What are the similarities and differences between these two images.'
394
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
395
  history=history, return_history=True)
396
- print(f'User: {question}\nAssistant: {response}')
 
397
 
398
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
399
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -401,17 +409,21 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
401
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
402
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
403
 
404
- question = 'Image-1: <image>\nImage-2: <image>\nDescribe the two images in detail.'
 
 
405
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
406
  num_patches_list=num_patches_list,
407
  history=None, return_history=True)
408
- print(f'User: {question}\nAssistant: {response}')
 
409
 
410
  question = 'What are the similarities and differences between these two images.'
411
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
412
  num_patches_list=num_patches_list,
413
  history=history, return_history=True)
414
- print(f'User: {question}\nAssistant: {response}')
 
415
 
416
  # batch inference, single image per sample (单图批处理)
417
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
@@ -419,13 +431,15 @@ pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat1
419
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
420
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
421
 
422
- questions = ['<image>\nDescribe the image in detail.'] * len(num_patches_list)
 
423
  responses = model.batch_chat(tokenizer, pixel_values,
424
  num_patches_list=num_patches_list,
425
  questions=questions,
426
  generation_config=generation_config)
427
  for question, response in zip(questions, responses):
428
- print(f'User: {question}\nAssistant: {response}')
 
429
 
430
  # video multi-round conversation (视频多轮对话)
431
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
@@ -463,17 +477,24 @@ def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=3
463
  video_path = './examples/red-panda.mp4'
464
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
465
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
466
- video_prefix = ''.join([f'Frame{i+1}: <image>\n' for i in range(len(num_patches_list))])
 
467
  question = video_prefix + 'What is the red panda doing?'
468
- # Frame1: <image>\nFrame2: <image>\n...\nFrame8: <image>\n{question}
 
 
 
 
469
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
470
  num_patches_list=num_patches_list, history=None, return_history=True)
471
- print(f'User: {question}\nAssistant: {response}')
 
472
 
473
  question = 'Describe this video in detail.'
474
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
475
  num_patches_list=num_patches_list, history=history, return_history=True)
476
- print(f'User: {question}\nAssistant: {response}')
 
477
  ```
478
 
479
  #### Streaming Output
@@ -555,7 +576,9 @@ image_urls=[
555
 
556
  images = [load_image(img_url) for img_url in image_urls]
557
  # Numbering images improves multi-image conversations
558
- response = pipe((f'Image-1: {IMAGE_TOKEN}\nImage-2: {IMAGE_TOKEN}\ndescribe these two images', images))
 
 
559
  print(response.text)
560
  ```
561
 
@@ -656,7 +679,7 @@ If you find this project useful in your research, please consider citing:
656
  year={2024}
657
  }
658
  @article{gao2024mini,
659
- title={Mini-internvl: A flexible-transfer pocket multimodal model with 5\% parameters and 90\% performance},
660
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
661
  journal={arXiv preprint arXiv:2410.16261},
662
  year={2024}
@@ -675,3 +698,13 @@ If you find this project useful in your research, please consider citing:
675
  year={2024}
676
  }
677
  ```
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
+ - OpenGVLab/InternViT-300M-448px-V2_5
4
+ - internlm/internlm2_5-7b-chat
5
+ datasets:
6
+ - HuggingFaceFV/finevideo
7
  language:
8
+ - multilingual
9
+ library_name: transformers
10
+ license: mit
11
+ pipeline_tag: image-text-to-text
12
  tags:
13
+ - internvl
14
+ - custom_code
15
+ base_model_relation: merge
 
16
  ---
17
 
18
  # InternVL2_5-8B
 
25
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
26
  </div>
27
 
28
+ ## Abstract
29
 
30
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data quality. In this work, we delve into the relationship between model scaling and performance, systematically exploring the performance trends in vision encoders, language models, dataset sizes, and test-time configurations. Through extensive evaluations on a wide range of benchmarks, including multi-discipline reasoning, document understanding, multi-image / video understanding, real-world comprehension, multimodal hallucination detection, visual grounding, multilingual capabilities, and pure language processing, InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a 3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing strong potential for test-time scaling. We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
 
 
31
 
32
  ## InternVL 2.5 Family
33
 
 
119
 
120
  The pipeline includes two modules, for **pure-text data**, three key strategies are used:
121
 
122
+ 1. **LLM-Based Quality Scoring**: Each sample is scored (0–10) using a pre-trained LLM with domain-specific prompts. Samples scoring below a threshold (e.g., 7) are removed to ensure high-quality data.
123
+ 2. **Repetition Detection**: Repetitive samples are flagged using LLM-based prompts and manually reviewed. Samples scoring below a stricter threshold (e.g., 3) are excluded to avoid repetitive patterns.
124
+ 3. **Heuristic Rule-Based Filtering**: Anomalies like abnormal sentence lengths or duplicate lines are detected using rules. Flagged samples undergo manual verification to ensure accuracy before removal.
125
 
126
  For **multimodal data**, two strategies are used:
127
 
128
+ 1. **Repetition Detection**: Repetitive samples in non-academic datasets are flagged and manually reviewed to prevent pattern loops. High-quality datasets are exempt from this process.
129
+ 2. **Heuristic Rule-Based Filtering**: Similar rules are applied to detect visual anomalies, with flagged data verified manually to maintain integrity.
130
 
131
  #### Training Data
132
 
 
358
  # pure-text conversation (纯文本对话)
359
  question = 'Hello, who are you?'
360
  response, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)
361
+ print(f'User: {question}
362
+ Assistant: {response}')
363
 
364
  question = 'Can you tell me a story?'
365
  response, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)
366
+ print(f'User: {question}
367
+ Assistant: {response}')
368
 
369
  # single-image single-round conversation (单图单轮对话)
370
+ question = '<image>
371
+ Please describe the image shortly.'
372
  response = model.chat(tokenizer, pixel_values, question, generation_config)
373
+ print(f'User: {question}
374
+ Assistant: {response}')
375
 
376
  # single-image multi-round conversation (单图多轮对话)
377
+ question = '<image>
378
+ Please describe the image in detail.'
379
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
380
+ print(f'User: {question}
381
+ Assistant: {response}')
382
 
383
  question = 'Please write a poem according to the image.'
384
  response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)
385
+ print(f'User: {question}
386
+ Assistant: {response}')
387
 
388
  # multi-image multi-round conversation, combined images (多图多轮对话,拼接图像)
389
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
390
  pixel_values2 = load_image('./examples/image2.jpg', max_num=12).to(torch.bfloat16).cuda()
391
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
392
 
393
+ question = '<image>
394
+ Describe the two images in detail.'
395
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
396
  history=None, return_history=True)
397
+ print(f'User: {question}
398
+ Assistant: {response}')
399
 
400
  question = 'What are the similarities and differences between these two images.'
401
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
402
  history=history, return_history=True)
403
+ print(f'User: {question}
404
+ Assistant: {response}')
405
 
406
  # multi-image multi-round conversation, separate images (多图多轮对话,独立图像)
407
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
409
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
410
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
411
 
412
+ question = 'Image-1: <image>
413
+ Image-2: <image>
414
+ Describe the two images in detail.'
415
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
416
  num_patches_list=num_patches_list,
417
  history=None, return_history=True)
418
+ print(f'User: {question}
419
+ Assistant: {response}')
420
 
421
  question = 'What are the similarities and differences between these two images.'
422
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
423
  num_patches_list=num_patches_list,
424
  history=history, return_history=True)
425
+ print(f'User: {question}
426
+ Assistant: {response}')
427
 
428
  # batch inference, single image per sample (单图批处理)
429
  pixel_values1 = load_image('./examples/image1.jpg', max_num=12).to(torch.bfloat16).cuda()
 
431
  num_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]
432
  pixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)
433
 
434
+ questions = ['<image>
435
+ Describe the image in detail.'] * len(num_patches_list)
436
  responses = model.batch_chat(tokenizer, pixel_values,
437
  num_patches_list=num_patches_list,
438
  questions=questions,
439
  generation_config=generation_config)
440
  for question, response in zip(questions, responses):
441
+ print(f'User: {question}
442
+ Assistant: {response}')
443
 
444
  # video multi-round conversation (视频多轮对话)
445
  def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
 
477
  video_path = './examples/red-panda.mp4'
478
  pixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)
479
  pixel_values = pixel_values.to(torch.bfloat16).cuda()
480
+ video_prefix = ''.join([f'Frame{i+1}: <image>
481
+ ' for i in range(len(num_patches_list))])
482
  question = video_prefix + 'What is the red panda doing?'
483
+ # Frame1: <image>
484
+ Frame2: <image>
485
+ ...
486
+ Frame8: <image>
487
+ {question}
488
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
489
  num_patches_list=num_patches_list, history=None, return_history=True)
490
+ print(f'User: {question}
491
+ Assistant: {response}')
492
 
493
  question = 'Describe this video in detail.'
494
  response, history = model.chat(tokenizer, pixel_values, question, generation_config,
495
  num_patches_list=num_patches_list, history=history, return_history=True)
496
+ print(f'User: {question}
497
+ Assistant: {response}')
498
  ```
499
 
500
  #### Streaming Output
 
576
 
577
  images = [load_image(img_url) for img_url in image_urls]
578
  # Numbering images improves multi-image conversations
579
+ response = pipe((f'Image-1: {IMAGE_TOKEN}
580
+ Image-2: {IMAGE_TOKEN}
581
+ describe these two images', images))
582
  print(response.text)
583
  ```
584
 
 
679
  year={2024}
680
  }
681
  @article{gao2024mini,
682
+ title={Mini-internvl: A flexible-transfer pocket multimodal model with 5% parameters and 90% performance},
683
  author={Gao, Zhangwei and Chen, Zhe and Cui, Erfei and Ren, Yiming and Wang, Weiyun and Zhu, Jinguo and Tian, Hao and Ye, Shenglong and He, Junjun and Zhu, Xizhou and others},
684
  journal={arXiv preprint arXiv:2410.16261},
685
  year={2024}
 
698
  year={2024}
699
  }
700
  ```
701
+
702
+ ## Acknowledgement
703
+
704
+ InternVL is built with reference to the code of the following projects: [OpenAI CLIP](https://github.com/openai/CLIP), [Open CLIP](https://github.com/mlfoundations/open_clip), [CLIP Benchmark](https://github.com/LAION-AI/CLIP_benchmark), [EVA](https://github.com/baaivision/EVA/tree/master), [InternImage](https://github.com/OpenGVLab/InternImage), [ViT-Adapter](https://github.com/czczup/ViT-Adapter), [MMSegmentation](https://github.com/open-mmlab/mmsegmentation), [Transformers](https://github.com/huggingface/transformers), [DINOv2](https://github.com/facebookresearch/dinov2), [BLIP-2](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [Qwen-VL](https://github.com/QwenLM/Qwen-VL/tree/master/eval_mm), and [LLaVA-1.5](https://github.com/haotian-liu/LLaVA). Thanks for their awesome work!
705
+
706
+ ______________________________________________________________________
707
+
708
+ Scan the following QR Code, join our WeChat group.
709
+
710
+ <p align="center"><img width="300" alt="image" src="https://github.com/user-attachments/assets/f776df09-ebba-4fd5-80c2-fec4ff1518be"></p>