WebOrganizer
/

FormatClassifier

@@ -6,25 +6,27 @@ datasets:
 base_model:
 - Alibaba-NLP/gte-base-en-v1.5
 ---
-# WebOrganizer/FormatClassifier-NoURL
 [[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
-The FormatClassifier-NoURL organizes web content into 24 categories based on the text contents of web pages (without using URL information).
 The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
 1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
 2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
 ##### All Domain Classifiers
-- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) (using URL and text contents)
-- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) *← you are here!*
 - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
 - [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
 ## Usage
-This classifier expects input in the following format:
 ```
 {text}
 ```
@@ -32,13 +34,15 @@ Example:
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
-tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier-NoURL")
 model = AutoModelForSequenceClassification.from_pretrained(
-    "WebOrganizer/FormatClassifier-NoURL",
     trust_remote_code=True,
     use_memory_efficient_attention=False)
-web_page = """How to make a good sandwich? [Click here to read article]"""
 inputs = tokenizer([web_page], return_tensors="pt")
 outputs = model(**inputs)
@@ -80,7 +84,7 @@ The full definitions of the categories can be found in the [taxonomy config](htt
 We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
 ```python
 AutoModelForSequenceClassification.from_pretrained(
-    "WebOrganizer/FormatClassifier-NoURL",
     trust_remote_code=True,
     unpad_inputs=True,
     use_memory_efficient_attention=True,
@@ -89,6 +93,7 @@ AutoModelForSequenceClassification.from_pretrained(
 ```
 See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
 ## Citation
 ```bibtex
 @article{wettig2025organize,
@@ -96,4 +101,4 @@ See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-en
   author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
   year={2025}
 }
-```

 base_model:
 - Alibaba-NLP/gte-base-en-v1.5
 ---
+# WebOrganizer/FormatClassifier
 [[Paper](ARXIV_TBD)] [[Website](WEBSITE_TBD)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
+The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
 The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
 1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
 2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
 ##### All Domain Classifiers
+- [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
+- [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL) (using only text contents)
 - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier) (using URL and text contents)
 - [WebOrganizer/TopicClassifier-NoURL](https://huggingface.co/WebOrganizer/TopicClassifier-NoURL) (using only text contents)
 ## Usage
+This classifier expects input in the following input format:
 ```
+{url}
 {text}
 ```
 ```python
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/FormatClassifier")
 model = AutoModelForSequenceClassification.from_pretrained(
+    "WebOrganizer/FormatClassifier",
     trust_remote_code=True,
     use_memory_efficient_attention=False)
+web_page = """http://www.example.com
+How to make a good sandwich? [Click here to read article]"""
 inputs = tokenizer([web_page], return_tensors="pt")
 outputs = model(**inputs)
 We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ and loading the model like
 ```python
 AutoModelForSequenceClassification.from_pretrained(
+    "WebOrganizer/FormatClassifier",
     trust_remote_code=True,
     unpad_inputs=True,
     use_memory_efficient_attention=True,
 ```
 See details [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers).
 ## Citation
 ```bibtex
 @article{wettig2025organize,
   author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
   year={2025}
 }
+```