mzhaoshuai
/

Llama-3-8B-Instruct-refalign

Text Generation

text-generation-inference

Model card Files Files and versions

RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

The training data is mzhaoshuai/Llama-3.3-70B-Inst-awq_ultrafeedback_1in3.

When conducting Reinforcement Learning with Similarity-based Rewards, the reward function is BERTScore.

Hyper-Parameters	Value
LR	2.5e-6
Batch Size	512
Epoch	1
Prompt Length	600
Generation Length	1200
Advantage CLIP	0.08
Sampled Generations (K)	2
BertScore Model	bart-large-mnli

Downloads last month: 3

Safetensors

Model size

8B params

Tensor type

BF16

·

Model tree for mzhaoshuai/Llama-3-8B-Instruct-refalign

Base model

meta-llama/Meta-Llama-3-8B-Instruct

Finetuned

(1049)

this model

Dataset used to train mzhaoshuai/Llama-3-8B-Instruct-refalign

Collection including mzhaoshuai/Llama-3-8B-Instruct-refalign

RefAlign: RL with Similarity-based Rewards

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated Oct 30, 2025 • 1

Paper for mzhaoshuai/Llama-3-8B-Instruct-refalign

Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data

Paper • 2504.09895 • Published Apr 14, 2025 • 1