funner234 commited on
Commit
9e77699
·
1 Parent(s): 72f95a1

add readme

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md CHANGED
@@ -1,3 +1,139 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ tags:
7
+ - long-context
8
+ - infllm
9
+ - sparse-attention
10
  ---
11
+
12
+ # InfLLM-V2-Long-Sparse-Base
13
+
14
+ **Project Links**: [[Paper](https://arxiv.org/abs/2509.24663)] [[InfLLM-V2 Models](https://huggingface.co/openbmb/InfLLM-V2-Long-Sparse-Base)] [[CUDA Kernel Code](https://github.com/OpenBMB/infllmv2_cuda_impl)]
15
+
16
+ ---
17
+
18
+ ## 🚀 Model Description
19
+
20
+ `InfLLM-V2-Long-Sparse-Base` is the final, long-context capable model in the InfLLM-V2 series, featuring a native **sparse attention** mechanism.
21
+
22
+ This model is the result of continued training on the [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base) model using just **5B tokens** of high-quality long-text data from the [**InfLLM-V2-data-5B**](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B) dataset.
23
+
24
+ Its core feature is the InfLLM-V2 architecture, which allows it to seamlessly switch between high-performance dense attention for short texts and highly efficient sparse attention for long texts. This enables the model to process extremely long sequences with significant end-to-end acceleration while preserving performance on standard tasks.
25
+
26
+ ## 📌 The Training Journey
27
+
28
+ This model is the final product of the transparent and reproducible InfLLM-V2 training workflow:
29
+
30
+ - **Step 1: Start from the base model.**
31
+ - The journey begins with [**InfLLM-V2-Short-Dense-Base**](https://huggingface.co/openbmb/InfLLM-V2-Short-Dense-Base), a model pre-trained on short texts.
32
+
33
+ - **Step 2: Continue training on long-text data.**
34
+ - We then perform continued training using the [**InfLLM-V2-data-5B**](https://huggingface.co/datasets/openbmb/InfLLM-V2-data-5B) dataset.
35
+
36
+ - **Step 3: Get the final long-context model.**
37
+ - ➡️ The result is **`InfLLM-V2-Long-Sparse-Base` (This Model)**, which has unlocked powerful long-context capabilities.
38
+
39
+ ## 💻 Usage
40
+
41
+ InfLLM-V2-Long-Sparse-Base supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu.
42
+
43
+ - Dense attention inference: vLLM, SGLang, Huggingface Transformers
44
+ - Sparse attention inference: Huggingface Transformers, CPM.cu
45
+
46
+ **To facilitate researches in sparse attention, we provide [InfLLM-V2 Training Kernels](https://github.com/OpenBMB/infllmv2_cuda_impl) and [InfLLM-V2 Inference Kernels](GitHub - OpenBMB/CPM.cu: CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, opt).**
47
+
48
+ ### Inference with Transformers
49
+ InfLLM-V2-Long-Sparse-Base requires `transformers>=4.56`.
50
+
51
+ - **Inference with Dense Attention**
52
+ ```python
53
+ from transformers import AutoModelForCausalLM, AutoTokenizer
54
+ import torch
55
+ torch.manual_seed(0)
56
+ path = 'openbmb/InfLLM-V2-Long-Sparse-Base'
57
+ device = "cuda"
58
+ tokenizer = AutoTokenizer.from_pretrained(path)
59
+ model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
60
+ # User can directly use the chat interface
61
+ # responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
62
+ # print(responds)
63
+ # User can also use the generate interface
64
+ messages = [
65
+ {"role": "user", "content": "Write an article about Artificial Intelligence."},
66
+ ]
67
+ prompt_text = tokenizer.apply_chat_template(
68
+ messages,
69
+ tokenize=False,
70
+ add_generation_prompt=True,
71
+ )
72
+ model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
73
+ model_outputs = model.generate(
74
+ **model_inputs,
75
+ max_new_tokens=32768,
76
+ top_p=0.95,
77
+ temperature=0.6
78
+ )
79
+ output_token_ids = [
80
+ model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
81
+ ]
82
+ responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
83
+ print(responses)
84
+ ```
85
+
86
+ - **Inference with Sparse Attention**
87
+ InfLLM-V2-Long-Sparse-Base supports `InfLLM v2`, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https://github.com/OpenBMB/infllmv2_cuda_impl) library.
88
+
89
+ You can install it by running the following command:
90
+ ```bash
91
+ git clone -b feature_infer https://github.com/OpenBMB/infllmv2_cuda_impl.git
92
+ cd infllmv2_cuda_impl
93
+ git submodule update --init --recursive
94
+ pip install -e . # or python setup.py install
95
+ ```
96
+
97
+ To enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:
98
+ ```json
99
+ {
100
+ ...,
101
+ "sparse_config": {
102
+ "kernel_size": 32,
103
+ "kernel_stride": 16,
104
+ "init_blocks": 1,
105
+ "block_size": 64,
106
+ "window_size": 2048,
107
+ "topk": 64,
108
+ "use_nope": false,
109
+ "dense_len": 8192
110
+ }
111
+ }
112
+ ```
113
+
114
+ These parameters control the behavior of InfLLM v2:
115
+ * `kernel_size` (default: 32): The size of semantic kernels.
116
+ * `kernel_stride` (default: 16): The stride between adjacent kernels.
117
+ * `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.
118
+ * `block_size` (default: 64): The block size for key-value blocks.
119
+ * `window_size` (default: 2048): The size of the local sliding window.
120
+ * `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks.
121
+ * `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.
122
+ * `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.
123
+
124
+
125
+ ## Citation
126
+
127
+ If you use our work in your research, please cite our paper:
128
+
129
+ ```bibtex
130
+ @misc{zhao2025infllmv2densesparseswitchableattention,
131
+ title={InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation},
132
+ author={Weilin Zhao and Zihan Zhou and Zhou Su and Chaojun Xiao and Yuxuan Li and Yanghao Li and Yudi Zhang and Weilun Zhao and Zhen Li and Yuxiang Huang and Ao Sun and Xu Han and Zhiyuan Liu},
133
+ year={2025},
134
+ eprint={2509.24663},
135
+ archivePrefix={arXiv},
136
+ primaryClass={cs.CL},
137
+ url={https://arxiv.org/abs/2509.24663},
138
+ }
139
+ ```