Rubric-based RL vs Verifier-based RL
As scaling shifts from pre-training to reinforcement learning, two RL modes are emerging: rubric-based RL and verifier-based RL. They mainly differ in what “success” means and how the reward is generated:
- Rubric-based RL: reward is a score from a (often multi-criteria) rubric, typically produced by a judge (human or model-as-judge)
- Verifier-based RL: reward comes from an objective verifier — pass/fail (or numeric) that checks correctness against ground truth or hard constraints
In practice, the best results usually come from combining both: use verifier RL for correctness-critical cores (e.g., code that must pass tests), then rubric RL / preference tuning for presentation, helpfulness, and safety.