Model Card: multi-hop-rag-reranker

Model Description

multi-hop-rag-reranker is a highly specialized cross-encoder model based on the microsoft/deberta-v3-base architecture. It is designed for a single, critical task: predicting the relevance of a candidate text passage to a complex, multi-faceted query.

This model has been meticulously trained and fine-tuned through a sequential, two-stage process on distinct and challenging question-answering datasets. This regimen has endowed it with a unique dual capability: it excels at identifying both direct factual evidence (typical of traditional QA) and implicit logical connections (essential for multi-hop reasoning).

The model is intended to be used as a high-precision re-ranking component within advanced Retrieval-Augmented Generation (RAG) pipelines. It takes a query and a candidate sentence as input and outputs a single logit score between 0 and 1, representing the probability that the sentence is a relevant piece of evidence for answering the query.

Please note that this reranker is only an experimental version and not ready for use in actual production environment, due to its limited training data, which might not make it to generalize well to OOD data.

Training Process: A Two-Stage Curriculum

The exceptional performance of this model is a direct result of a deliberate, two-stage training curriculum. This process was designed to first build a foundational understanding of complex reasoning and then refine that understanding with a focus on factual precision.

Stage 1: Foundational Training on Multi-Hop Reasoning (hotpot_qa)

The initial training phase was conducted on the HotpotQA subset of the TIGER-Lab/LongRAG dataset. This dataset is specifically designed for multi-hop question answering, where the final answer can only be deduced by chaining together information from multiple, disparate documents.

  • Objective: To teach the model to recognize "bridge" sentences and sentences that form part of a logical chain, even if they are not semantically identical to the query.
  • Data Preprocessing & Feature Engineering: The input was not a simple query [SEP] sentence pair. To maximize the learning signal, a rich, structured input format was programmatically generated:
    [HOP:{hop}] [ROLE:{role}] [ENT:{entities}] [CAUSAL] [REASONING] query [SEP] sentence
    
    • [HOP:X]: A dynamically calculated feature indicating the sentence's logical distance from the query's core topic.
    • [ROLE:X]: The sentence's rhetorical function (e.g., Main_Claim, Supporting_Evidence), determined through regex-based classification.
    • [ENT:X,Y,Z]: Key named entities extracted from the sentence using a high-performance transformer-based NER model (dslim/bert-base-NER).
    • [CAUSAL] & [REASONING]: Special tokens added if the sentence contained causal language or a high density of entities, signaling its importance in an explanatory context. The model's tokenizer was extended to include these special tokens, allowing it to learn their semantic meaning.
  • Negative Sampling Strategy: To create a challenging learning environment, a sophisticated negative sampling strategy was employed. For each positive example, a set of "hard negatives" (sentences semantically similar to the query but factually incorrect) and "easy negatives" were selected. This prevents the model from relying on simple keyword matching and forces it to learn deeper semantic relevance.
  • Training Configuration: A WeightedTrainer was used, applying a pos_weight of 1.2 to the BCEWithLogitsLoss function. This penalizes the model more for false negatives (missing a crucial piece of evidence), which is a critical behavior for a RAG ranker. The model was trained for 3 epochs with a learning rate of 1.5e-5.
  • Outcome: This stage produced a base model (neural_ranker_final_hotpotqa) with an exceptional F1 score of 0.9814 and an AUC of 0.9932 on the HotpotQA test set. This model became an expert at identifying sentences that are part of a complex argument.

Stage 2: Continual Fine-Tuning on Factual Lookups (nq)

While the HotpotQA model excelled at reasoning, it needed to be adapted to also handle direct, single-hop factual questions with high precision. This was achieved through a continual fine-tuning process on the Natural Questions (NQ) subset of the TIGER-Lab/LongRAG dataset.

  • Objective: To transfer the reasoning capabilities of the Stage 1 model and adapt them to the domain of high-precision, factual retrieval without "catastrophic forgetting."
  • Data Preprocessing: The same rich, feature-engineered input format was used. The is_positive_source label for NQ was determined by whether a sentence contained the ground-truth answer string, a robust method for identifying fact-bearing sentences.
  • Training Configuration: The fully trained neural_ranker_hotpot_stable model was loaded as the starting point. A significantly lower learning rate (2e-6) was used for 2 epochs. This "low and slow" fine-tuning is critical to adapt the model to the new domain without destroying the complex features learned in Stage 1.
  • Outcome: This stage produced the final model. It retained its high performance on reasoning tasks while achieving a new, high level of performance on factual lookups, with a final F1 score of 0.9182 and an AUC of 0.9644 on the NQ test set.

Intended Use & How to Get the Best Performance

This model is intended for use as a re-ranker in a RAG pipeline. To achieve maximum performance, the input at inference time should match the rich, structured format it was trained on.

Example Usage:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the trained model and tokenizer
model_path = "./models/neural_ranker_hotpot_nq_final3"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model.eval() # Set to evaluation mode

# Example query and candidate sentence with generated metadata
query = "What is the primary stability issue with perovskite solar cells?"
candidate_sentence = {
    "text": "While perovskite solar cells have achieved high efficiencies, their adoption is crippled by instability from humidity and oxygen.",
    "hop": 0,
    "role": "Main_Claim",
    "entities": ["perovskite solar cells"],
    "is_causal": True
}

# Construct the feature-rich input string
feature_str = f"[HOP:{candidate_sentence['hop']}] [ROLE:{candidate_sentence['role']}] [ENT:{','.join(candidate_sentence['entities'])}] [CAUSAL]"
input_text = f"{feature_str} {query} [SEP] {candidate_sentence['text']}"

# Tokenize and predict
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
    logits = model(**inputs).logits
    score = torch.sigmoid(logits).item()

print(f"Relevance Score: {score:.4f}")
# Expected Output: A high relevance score (e.g., > 0.95)

Detailed training script, and some experimental RAG setups using the finetuned neural ranker, is at here

Limitations

  • Domain Sensitivity: While trained on two diverse datasets, the model's performance may vary on highly specialized or out-of-domain text (e.g., legal documents, poetry).
  • Dependence on Input Features: The model's peak performance is achieved when provided with the full, structured input. While it will still function with a simple query [SEP] sentence format, its precision might be lower as it cannot leverage its learned understanding of the special feature tokens.
  • Not a Standalone Retriever: This is a re-ranker, not a retriever. It is designed to be used on a small set of promising candidates (e.g., the top 50-100) returned by a more efficient first-stage retriever like BM25 or a dense vector search. Running it over an entire corpus would be computationally prohibitive.

Benchmarks

On my own queries & corresponding docs, my neural reranker performs pretty well compared to a set of some other recent SOTA equivalently small-sized rerankers involving:

  • Jina-multilingual-reranker-v2-base(0.3B)
  • gte-multilingual-reranker-base(0.3B)
  • BGE-reranker-v2-m3(0.6B)
  • Qwen3-Reranker-0.6B(0.6B)
Model Size Perfect Scores Failed Queries Failure Rate Key Weaknesses Strengths
Custom Finetuned 184M 12/12 None 0% None Perfect accuracy, consistent performance
Qwen3-Reranker 0.6B 12/12 None 0% None Perfect accuracy, larger model
gte-multilingual 0.3B 10/12 Q4, Q11 16.7% Abstract synthesis, nuanced distinctions Fast, multilingual support
Jina-multilingual 0.3B 10/12 Q6, Q11 16.7% Abstract synthesis, linking strategies Good balance, multilingual
BGE-reranker-v2-m3 0.6B 8/12 Q2, Q4, Q6, Q11 33.3% Abstract synthesis, nuanced distinctions, linking strategies Good on factual queries

For more details about this, visit here to see the setups as well as the docs & queries.

On bigger test sets, the finetuned neural ranker show its weakness to when dealing with much diverse, out-of-distribution data. The .jsonl files for evaluation is here and here:

HotpotQA

The test of multi-hop reasoning on encyclopedic text.

Model NDCG@10 (Quality) MAP (Overall) MRR@10 (First Hit)
BGE v2 m3 99.99 74.22 87.23
Qwen 0.6B 99.41 71.41 88.27
GTE Base 100.00 71.09 84.83
Jina v2 99.95 69.80 84.30
Custom Finetuned 100.00 15.06 16.37

MuSiQue

A more complex, compositional multi-hop reasoning test.

Model NDCG@10 (Quality) MAP (Overall) MRR@10 (First Hit)
BGE v2 m3 100.00 75.19 92.23
GTE Base 100.00 72.84 91.59
Qwen 0.6B 99.86 69.39 89.10
Jina v2 99.97 69.69 89.96
Custom Finetuned 100.00 33.15 42.16

2WikiMultiHopQA

A different style of multi-hop reasoning.

Model NDCG@10 (Quality) MAP (Overall) MRR@10 (First Hit)
GTE Base 79.33 72.38 87.65
Qwen 0.6B 79.39 72.12 86.68
Jina v2 77.97 71.04 86.69
BGE v2 m3 77.41 70.35 85.73
Custom Finetuned 30.63 26.50 36.08

CUAD

A contract-understanding dataset for reranking evaluation.

Model NDCG@10 (Quality) MAP (Overall) MRR@10 (First Hit)
Qwen 0.6B 100.00 44.52 55.94
GTE Multilingual 99.83 40.53 46.98
BGE v2 M3 99.99 39.86 47.62
Jina v2 100.00 35.22 41.78
Custom Finetuned 99.85 27.43 36.20

The data tells a clear and consistent story across all three benchmarks.

My initial hypothesis was: "What if I teach the model the structure of a multi-hop answer using special tokens?" The training worked, and the model learned to recognize those features pretty darn good.

The problem is that it seemd to learn to rely exclusively on those features.

  • It thinks: "If I see [HOP:0] and [ROLE:Supporting_Evidence], this is a good sentence."
  • It does not think: "Does the semantic content of this sentence actually help answer the query?"

This is a case of overfitting to the training methodology. The model found a shortcut (the special tokens) and stopped developing a deeper, more robust semantic understanding of relevance. When it was tested on CUAD, 2Wiki, or even just asked to rank within the HotpotQA set (the MAP/MRR scores), this weakness was exposed. It knows which sentences belong in the "answer club," but it has no idea who the president is.

The off-the-shelf models never had this shortcut. They were forced to develop a pure, powerful semantic understanding of the relationship between a query and a passage, which has proven to be a far more generalizable and robust skill.

Downloads last month
1
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minhhungg/multi-hop-rag-reranker

Finetuned
(530)
this model

Dataset used to train minhhungg/multi-hop-rag-reranker