Model Card for Qwen2.5-7B-GRRM (Group Relative Reward Model)

Model Details

Model Description

Qwen2.5-7B-GRRM is a Generative Reward Model (GenRM) specifically optimized for Machine Translation (MT). It implements the Group Quality Metric (GQM) paradigm: the model evaluates a group of candidate translations jointly (rather than pointwise scoring), then produces a relative ranking and relative reward scores suitable for GRPO-style group-based policy optimization.

The model is initialized from Qwen2.5-7B, cold-started via Supervised Fine-Tuning (SFT) on Chinese–English data, and subsequently optimized for ranking accuracy using Reinforcement Learning with Verifiable Rewards (RLVR).

Model type: Generative Reward Model
Language(s): Multilingual (trained on Zh–En; demonstrated generalization to additional language pairs in the paper), including English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Russian, Chinese
License: Apache License 2.0
Finetuned from model: Qwen2.5-7B

Model Sources

Paper: GRRM: Group Relative Reward Modeling for Machine Translation
Repository: https://github.com/NJUNLP/GRRM

Uses

Direct Use

This model is intended to be used as a reward model for MT outputs, especially inside GRPO.

Typical direct usage:

Given an MT source and a set of candidate translations generated by a policy, call GRRM once on the whole group to obtain relative rewards, then compute advantages within the group.
Use as an MT evaluation component where pairwise/groupwise comparison is preferred to absolute scoring.

Input / Output Format

Input format

To match the training distribution, the input should follow these conventions:

Up to 4 candidates per group (recommended; see limitations below).
Candidate identifiers should be uppercase letters: A, B, C, D.
Each candidate translation must be wrapped in a fenced code block (triple backticks) to support multi-line outputs and reduce formatting ambiguity.
Provide the source text and then list candidates.

Output format

The model is trained to produce three sections in Markdown:

Comparative analysis (free-form, but in Markdown; often includes bullet points).
Final ranking in a clear, ordered form (e.g., A > C > B).
Scores: integer scores (0–10) for each candidate, strictly consistent with the ranking (ties allowed if explicitly ranked as ties).

Bias, Risks, and Limitations

Candidate count limitation: The model is trained primarily with 2–4 candidates per group. While you can try more candidates, ranking sensitivity and formatting adherence may degrade.
Relative (within-group) scoring: Rewards are calibrated for within-group comparison; do not interpret them as globally comparable quality scores across different inputs.
Cross-lingual robustness is not guaranteed: The model generalizes beyond Zh–En in experiments, but performance may degrade on low-resource languages, specialized domains, or unusual writing styles.
Potential judge biases: As a learned evaluator, it may be influenced by fluency, style preferences, or prompt framing, and can still make mistakes on nuanced adequacy/terminology issues.

How to Get Started with the Model

For high-throughput inference with vllm and official parsing script, please refer to our code repository.

Here's how you can run the model with Huggingface Transformers:

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_PATH = "double7/Qwen2.5-7b-EAX"

tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH, device_map="auto", torch_dtype="auto"
)

prompt = """Given a source text in English and multiple translation candidates in Chinese. Perform a step by step analysis and comparison of the translation quality for the candidates. Finally, rank and score the candidates with integer scores on a scale from 0 to 10.

Source text:
```
I have a frog in my throat.
```

Translation A:
```
我嗓子有点哑。
```
Translation B:
```
我嗓子眼里好像有只青蛙。
```
"""

# We use the tokenizer’s chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {"role": "user", "content": prompt},
]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, do_sample=False, max_new_tokens=256)
output_text = tokenizer.batch_decode(outputs, skip_special_tokens=False)[0]
print(output_text)

Output Example

An analysis and comparison of the translation candidates:

**Source Text:** "I have a frog in my throat."
This is an English idiomatic expression. Literally, it doesn't mean a frog is physically stuck in one's throat. It means the person has a tickle, lump, or irritation in their throat that affects their voice, often making them sound hoarse. It is used to describe a temporary condition.

**Translation A: 我嗓子有点哑。**
*   **Faithfulness:** This translation is not literal but captures the *consequence* or *function* of the source idiom. "我嗓子有点哑" means "My voice is a little hoarse." This is a very common and natural way in Chinese to express the same situation as "I have a frog in my throat." It correctly conveys the practical meaning in this context.
*   **Fluency:** This is a perfectly fluent and idiomatic Chinese expression that a native speaker would use in daily conversation.
*   **Overall:** A good translation that prioritizes communicative effectiveness over literalism. It is highly acceptable.

**Translation B: 我嗓子眼里好像有只青蛙。**
*   **Faithfulness:** This translation is very literal. "我嗓子眼里" means "in my throat," and "好像" ("as if" or "seemingly") is an attempt to translate the metaphorical nature of the source. The main problem is "有只青蛙" (there is a frog). This is a direct, word-for-word physical translation that completely misses the idiomatic meaning. It explains the phrase as if one was literally trying to describe a real frog in the throat.
*   **Fluency:** This sentence is grammatically correct, but it sounds highly unnatural and awkward in Chinese. While hypothetically it could be understood literally by some as a form of sick and disturbing joke, it is not used to express hoarseness in any common, standard way.
*   **Overall:** A poor translation because it fails to convey the correct idiomatic meaning and results in an unnatural, almost comical phrasing.

**Conclusion:**

*   **Translation A** is the superior choice. It effectively communicates the intended meaning in a natural, idiomatic way that is common in Chinese.
*   **Translation B** is a very literal, inaccurate, and awkward attempt that misses the point of the idiom and sounds incorrect.

### Final Ranking:

A > B

### Scores:

A: 8, B: 3

Training Details

Training Data

The model was trained on the Chinese-English subset of TowerBlocks, comprising approximately 18.8k samples (held out 512 samples for validation). For each source sentence, sample 2–4 translation candidates; labels are derived under GQM prompting using an external LLM judge (Gemini-2.5-Pro).

SFT Stage: Annotated with Gemini-2.5-Pro under the GQM paradigm (comparative reasoning).
RLVR Stage: Utilized the same underlying data, augmented by shuffling candidate permutations and subsampling from original groups to optimize ranking accuracy.

Training Procedure

The model is trained in two stages:

SFT (cold start): Teach the model to follow the groupwise judging instruction and output structured ranking and scores.
RLVR (ranking optimization): Optimize for verifiable ranking accuracy within each group. The reward is computed from ground-truth orderings using a normalized pairwise agreement metric, plus an auxiliary score-consistency reward to encourage well-distributed scalar outputs.

Training Hyperparameters

Hardware: 16 × NVIDIA A100 (80GB)
SFT (cold start):
- Epochs: 3
- Global batch size: 64
- LR scheduler: cosine
- Peak learning rate: 6e-6
- Warmup ratio: 0.1
RLVR (reward model optimization):
- RL algorithm: GSPO (Group Sequence Policy Optimization)
- Epochs: 1
- LR scheduler: cosine (min LR ratio = 0.2)
- Learning rate: 1e-5
- Total batch size: 512
- PPO mini-batch size: 128
- Rollouts per prompt: 8
- Max generation length: 4096 tokens
- KL penalty: disabled (no KL divergence penalty)
- Advantage normalization: standard deviation normalization removed

Citation

@article{yang2026grrmgrouprelativereward,
      title={GRRM: Group Relative Reward Modeling for Machine Translation}, 
      author={Sen Yang and Shanbo Cheng and Lu Xu and Jianbing Zhang and Shujian Huang},
      year={2026},
      eprint={2602.14028},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2602.14028},
}