skylenage commited on
Commit
26efbcf
·
verified ·
1 Parent(s): 3abe540

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -100,5 +100,4 @@ Previous PRMs often suffer from two major flaws: they ignore historical evaluati
100
  GPRM utilizes a two-stage progressive training pipeline:
101
 
102
  1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
103
- 2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.
104
- 3.
 
100
  GPRM utilizes a two-stage progressive training pipeline:
101
 
102
  1. **Stage I (Structured SFT):** Learns 4-dimensional diagnostic reasoning via targeted error injection (Calculation, Logic, Goal-drift, Inconsistency) using Qwen3-235B-Instruct as teacher for annotation.
103
+ 2. **Stage II (GRPO Optimization):** Refines evaluation policy under complete global context (History + Current + Future) using Group Relative Policy Optimization on hard-mined samples from PRM800K.