Model Card: multi-hop-rag-reranker
Model Description
multi-hop-rag-reranker is a highly specialized cross-encoder model based on the microsoft/deberta-v3-base architecture. It is designed for a single, critical task: predicting the relevance of a candidate text passage to a complex, multi-faceted query.
This model has been meticulously trained and fine-tuned through a sequential, two-stage process on distinct and challenging question-answering datasets. This regimen has endowed it with a unique dual capability: it excels at identifying both direct factual evidence (typical of traditional QA) and implicit logical connections (essential for multi-hop reasoning).
The model is intended to be used as a high-precision re-ranking component within advanced Retrieval-Augmented Generation (RAG) pipelines. It takes a query and a candidate sentence as input and outputs a single logit score between 0 and 1, representing the probability that the sentence is a relevant piece of evidence for answering the query.
Please note that this reranker is only an experimental version and not ready for use in actual production environment, due to its limited training data, which might not make it to generalize well to OOD data.
Training Process: A Two-Stage Curriculum
The exceptional performance of this model is a direct result of a deliberate, two-stage training curriculum. This process was designed to first build a foundational understanding of complex reasoning and then refine that understanding with a focus on factual precision.
Stage 1: Foundational Training on Multi-Hop Reasoning (hotpot_qa)
The initial training phase was conducted on the HotpotQA subset of the TIGER-Lab/LongRAG dataset. This dataset is specifically designed for multi-hop question answering, where the final answer can only be deduced by chaining together information from multiple, disparate documents.
- Objective: To teach the model to recognize "bridge" sentences and sentences that form part of a logical chain, even if they are not semantically identical to the query.
- Data Preprocessing & Feature Engineering: The input was not a simple
query [SEP] sentencepair. To maximize the learning signal, a rich, structured input format was programmatically generated:[HOP:{hop}] [ROLE:{role}] [ENT:{entities}] [CAUSAL] [REASONING] query [SEP] sentence[HOP:X]: A dynamically calculated feature indicating the sentence's logical distance from the query's core topic.[ROLE:X]: The sentence's rhetorical function (e.g.,Main_Claim,Supporting_Evidence), determined through regex-based classification.[ENT:X,Y,Z]: Key named entities extracted from the sentence using a high-performance transformer-based NER model (dslim/bert-base-NER).[CAUSAL]&[REASONING]: Special tokens added if the sentence contained causal language or a high density of entities, signaling its importance in an explanatory context. The model's tokenizer was extended to include these special tokens, allowing it to learn their semantic meaning.
- Negative Sampling Strategy: To create a challenging learning environment, a sophisticated negative sampling strategy was employed. For each positive example, a set of "hard negatives" (sentences semantically similar to the query but factually incorrect) and "easy negatives" were selected. This prevents the model from relying on simple keyword matching and forces it to learn deeper semantic relevance.
- Training Configuration: A
WeightedTrainerwas used, applying apos_weightof1.2to theBCEWithLogitsLossfunction. This penalizes the model more for false negatives (missing a crucial piece of evidence), which is a critical behavior for a RAG ranker. The model was trained for 3 epochs with a learning rate of1.5e-5. - Outcome: This stage produced a base model (
neural_ranker_final_hotpotqa) with an exceptional F1 score of 0.9814 and an AUC of 0.9932 on the HotpotQA test set. This model became an expert at identifying sentences that are part of a complex argument.
Stage 2: Continual Fine-Tuning on Factual Lookups (nq)
While the HotpotQA model excelled at reasoning, it needed to be adapted to also handle direct, single-hop factual questions with high precision. This was achieved through a continual fine-tuning process on the Natural Questions (NQ) subset of the TIGER-Lab/LongRAG dataset.
- Objective: To transfer the reasoning capabilities of the Stage 1 model and adapt them to the domain of high-precision, factual retrieval without "catastrophic forgetting."
- Data Preprocessing: The same rich, feature-engineered input format was used. The
is_positive_sourcelabel for NQ was determined by whether a sentence contained the ground-truth answer string, a robust method for identifying fact-bearing sentences. - Training Configuration: The fully trained
neural_ranker_hotpot_stablemodel was loaded as the starting point. A significantly lower learning rate (2e-6) was used for 2 epochs. This "low and slow" fine-tuning is critical to adapt the model to the new domain without destroying the complex features learned in Stage 1. - Outcome: This stage produced the final model. It retained its high performance on reasoning tasks while achieving a new, high level of performance on factual lookups, with a final F1 score of 0.9182 and an AUC of 0.9644 on the NQ test set.
Intended Use & How to Get the Best Performance
This model is intended for use as a re-ranker in a RAG pipeline. To achieve maximum performance, the input at inference time should match the rich, structured format it was trained on.
Example Usage:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the trained model and tokenizer
model_path = "./models/neural_ranker_hotpot_nq_final3"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval() # Set to evaluation mode
# Example query and candidate sentence with generated metadata
query = "What is the primary stability issue with perovskite solar cells?"
candidate_sentence = {
"text": "While perovskite solar cells have achieved high efficiencies, their adoption is crippled by instability from humidity and oxygen.",
"hop": 0,
"role": "Main_Claim",
"entities": ["perovskite solar cells"],
"is_causal": True
}
# Construct the feature-rich input string
feature_str = f"[HOP:{candidate_sentence['hop']}] [ROLE:{candidate_sentence['role']}] [ENT:{','.join(candidate_sentence['entities'])}] [CAUSAL]"
input_text = f"{feature_str} {query} [SEP] {candidate_sentence['text']}"
# Tokenize and predict
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
score = torch.sigmoid(logits).item()
print(f"Relevance Score: {score:.4f}")
# Expected Output: A high relevance score (e.g., > 0.95)
Detailed training script, and some experimental RAG setups using the finetuned neural ranker, is at here
Limitations
- Domain Sensitivity: While trained on two diverse datasets, the model's performance may vary on highly specialized or out-of-domain text (e.g., legal documents, poetry).
- Dependence on Input Features: The model's peak performance is achieved when provided with the full, structured input. While it will still function with a simple
query [SEP] sentenceformat, its precision might be lower as it cannot leverage its learned understanding of the special feature tokens. - Not a Standalone Retriever: This is a re-ranker, not a retriever. It is designed to be used on a small set of promising candidates (e.g., the top 50-100) returned by a more efficient first-stage retriever like BM25 or a dense vector search. Running it over an entire corpus would be computationally prohibitive.
Benchmarks
On my own queries & corresponding docs, my neural reranker performs pretty well compared to a set of some other recent SOTA equivalently small-sized rerankers involving:
Jina-multilingual-reranker-v2-base(0.3B)gte-multilingual-reranker-base(0.3B)BGE-reranker-v2-m3(0.6B)Qwen3-Reranker-0.6B(0.6B)
| Model | Size | Perfect Scores | Failed Queries | Failure Rate | Key Weaknesses | Strengths |
|---|---|---|---|---|---|---|
| Custom Finetuned | 184M | 12/12 | None | 0% | None | Perfect accuracy, consistent performance |
| Qwen3-Reranker | 0.6B | 12/12 | None | 0% | None | Perfect accuracy, larger model |
| gte-multilingual | 0.3B | 10/12 | Q4, Q11 | 16.7% | Abstract synthesis, nuanced distinctions | Fast, multilingual support |
| Jina-multilingual | 0.3B | 10/12 | Q6, Q11 | 16.7% | Abstract synthesis, linking strategies | Good balance, multilingual |
| BGE-reranker-v2-m3 | 0.6B | 8/12 | Q2, Q4, Q6, Q11 | 33.3% | Abstract synthesis, nuanced distinctions, linking strategies | Good on factual queries |
For more details about this, visit here to see the setups as well as the docs & queries.
On bigger test sets, the finetuned neural ranker show its weakness to when dealing with much diverse, out-of-distribution data. The .jsonl files for evaluation is here and here:
HotpotQA
The test of multi-hop reasoning on encyclopedic text.
| Model | NDCG@10 (Quality) | MAP (Overall) | MRR@10 (First Hit) |
|---|---|---|---|
| BGE v2 m3 | 99.99 | 74.22 | 87.23 |
| Qwen 0.6B | 99.41 | 71.41 | 88.27 |
| GTE Base | 100.00 | 71.09 | 84.83 |
| Jina v2 | 99.95 | 69.80 | 84.30 |
| Custom Finetuned | 100.00 | 15.06 | 16.37 |
MuSiQue
A more complex, compositional multi-hop reasoning test.
| Model | NDCG@10 (Quality) | MAP (Overall) | MRR@10 (First Hit) |
|---|---|---|---|
| BGE v2 m3 | 100.00 | 75.19 | 92.23 |
| GTE Base | 100.00 | 72.84 | 91.59 |
| Qwen 0.6B | 99.86 | 69.39 | 89.10 |
| Jina v2 | 99.97 | 69.69 | 89.96 |
| Custom Finetuned | 100.00 | 33.15 | 42.16 |
2WikiMultiHopQA
A different style of multi-hop reasoning.
| Model | NDCG@10 (Quality) | MAP (Overall) | MRR@10 (First Hit) |
|---|---|---|---|
| GTE Base | 79.33 | 72.38 | 87.65 |
| Qwen 0.6B | 79.39 | 72.12 | 86.68 |
| Jina v2 | 77.97 | 71.04 | 86.69 |
| BGE v2 m3 | 77.41 | 70.35 | 85.73 |
| Custom Finetuned | 30.63 | 26.50 | 36.08 |
CUAD
A contract-understanding dataset for reranking evaluation.
| Model | NDCG@10 (Quality) | MAP (Overall) | MRR@10 (First Hit) |
|---|---|---|---|
| Qwen 0.6B | 100.00 | 44.52 | 55.94 |
| GTE Multilingual | 99.83 | 40.53 | 46.98 |
| BGE v2 M3 | 99.99 | 39.86 | 47.62 |
| Jina v2 | 100.00 | 35.22 | 41.78 |
| Custom Finetuned | 99.85 | 27.43 | 36.20 |
The data tells a clear and consistent story across all three benchmarks.
My initial hypothesis was: "What if I teach the model the structure of a multi-hop answer using special tokens?" The training worked, and the model learned to recognize those features pretty darn good.
The problem is that it seemd to learn to rely exclusively on those features.
- It thinks: "If I see
[HOP:0]and[ROLE:Supporting_Evidence], this is a good sentence." - It does not think: "Does the semantic content of this sentence actually help answer the query?"
This is a case of overfitting to the training methodology. The model found a shortcut (the special tokens) and stopped developing a deeper, more robust semantic understanding of relevance. When it was tested on CUAD, 2Wiki, or even just asked to rank within the HotpotQA set (the MAP/MRR scores), this weakness was exposed. It knows which sentences belong in the "answer club," but it has no idea who the president is.
The off-the-shelf models never had this shortcut. They were forced to develop a pure, powerful semantic understanding of the relationship between a query and a passage, which has proven to be a far more generalizable and robust skill.
- Downloads last month
- 1
Model tree for minhhungg/multi-hop-rag-reranker
Base model
microsoft/deberta-v3-base