Update README.md
Browse files
README.md
CHANGED
|
@@ -6,25 +6,27 @@ datasets:
|
|
| 6 |
base_model:
|
| 7 |
- Alibaba-NLP/gte-base-en-v1.5
|
| 8 |
---
|
| 9 |
-
# WebOrganizer/FormatClassifier
|
| 10 |
|
| 11 |
[[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
|
| 12 |
|
| 13 |
-
The FormatClassifier
|
| 14 |
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
|
| 15 |
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
|
| 16 |
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
|
| 17 |
|
| 18 |
##### All Domain Classifiers
|
| 19 |
-
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier)
|
| 20 |
-
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
|
| 21 |
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
|
| 22 |
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
|
| 23 |
|
| 24 |
## Usage
|
| 25 |
|
| 26 |
-
This classifier expects input in the following format:
|
| 27 |
```
|
|
|
|
|
|
|
| 28 |
{text}
|
| 29 |
```
|
| 30 |
|
|
@@ -32,13 +34,15 @@ Example:
|
|
| 32 |
```python
|
| 33 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 34 |
|
| 35 |
-
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier
|
| 36 |
model = AutoModelForSequenceClassification.from_pretrained(
|
| 37 |
-
"WebOrganizer/FormatClassifier
|
| 38 |
trust_remote_code=True,
|
| 39 |
use_memory_efficient_attention=False)
|
| 40 |
|
| 41 |
-
web_page = """
|
|
|
|
|
|
|
| 42 |
|
| 43 |
inputs = tokenizer([web_page], return_tensors="pt")
|
| 44 |
outputs = model(**inputs)
|
|
@@ -80,7 +84,7 @@ The full definitions of the categories can be found in the [taxonomy config](htt
|
|
| 80 |
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
|
| 81 |
```python
|
| 82 |
AutoModelForSequenceClassification.from_pretrained(
|
| 83 |
-
"WebOrganizer/FormatClassifier
|
| 84 |
trust_remote_code=True,
|
| 85 |
unpad_inputs=True,
|
| 86 |
use_memory_efficient_attention=True,
|
|
@@ -89,6 +93,7 @@ AutoModelForSequenceClassification.from_pretrained(
|
|
| 89 |
```
|
| 90 |
See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
|
| 91 |
|
|
|
|
| 92 |
## Citation
|
| 93 |
```bibtex
|
| 94 |
@article{wettig2025organize,
|
|
@@ -96,4 +101,4 @@ See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-en
|
|
| 96 |
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
|
| 97 |
year={2025}
|
| 98 |
}
|
| 99 |
-
```
|
|
|
|
| 6 |
base_model:
|
| 7 |
- Alibaba-NLP/gte-base-en-v1.5
|
| 8 |
---
|
| 9 |
+
# WebOrganizer/FormatClassifier
|
| 10 |
|
| 11 |
[[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
|
| 12 |
|
| 13 |
+
The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
|
| 14 |
The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
|
| 15 |
1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
|
| 16 |
2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
|
| 17 |
|
| 18 |
##### All Domain Classifiers
|
| 19 |
+
- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
|
| 20 |
+
- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) (using only text contents)
|
| 21 |
- [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
|
| 22 |
- [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
|
| 23 |
|
| 24 |
## Usage
|
| 25 |
|
| 26 |
+
This classifier expects input in the following input format:
|
| 27 |
```
|
| 28 |
+
{url}
|
| 29 |
+
|
| 30 |
{text}
|
| 31 |
```
|
| 32 |
|
|
|
|
| 34 |
```python
|
| 35 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 36 |
|
| 37 |
+
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
|
| 38 |
model = AutoModelForSequenceClassification.from_pretrained(
|
| 39 |
+
"WebOrganizer/FormatClassifier",
|
| 40 |
trust_remote_code=True,
|
| 41 |
use_memory_efficient_attention=False)
|
| 42 |
|
| 43 |
+
web_page = """http://www.example.com
|
| 44 |
+
|
| 45 |
+
How to make a good sandwich? [Click here to read article]"""
|
| 46 |
|
| 47 |
inputs = tokenizer([web_page], return_tensors="pt")
|
| 48 |
outputs = model(**inputs)
|
|
|
|
| 84 |
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
|
| 85 |
```python
|
| 86 |
AutoModelForSequenceClassification.from_pretrained(
|
| 87 |
+
"WebOrganizer/FormatClassifier",
|
| 88 |
trust_remote_code=True,
|
| 89 |
unpad_inputs=True,
|
| 90 |
use_memory_efficient_attention=True,
|
|
|
|
| 93 |
```
|
| 94 |
See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
|
| 95 |
|
| 96 |
+
|
| 97 |
## Citation
|
| 98 |
```bibtex
|
| 99 |
@article{wettig2025organize,
|
|
|
|
| 101 |
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
|
| 102 |
year={2025}
|
| 103 |
}
|
| 104 |
+
```
|