Update README.md
Browse files
README.md
CHANGED
|
@@ -100,5 +100,4 @@ Previous PRMs often suffer from two major flaws: they ignore historical evaluati
|
|
| 100 |
GPRM utilizes a two-stage progressive training pipeline:
|
| 101 |
|
| 102 |
1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
|
| 103 |
-
2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.
|
| 104 |
-
3.
|
|
|
|
| 100 |
GPRM utilizes a two-stage progressive training pipeline:
|
| 101 |
|
| 102 |
1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
|
| 103 |
+
2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.
|
|
|