</aside>

1. Grasp Math-Shepherd with Just One Figure

Comparison for previous automatic outcome annotation and our automatic process annotation. (a): automatic outcome annotation assigns a label to the entire solution S, dependent on the correctness of the answer; (b) automatic process annotation employs a ‘completer’ to finalize N reasoning processes (N=3 in this figure) for an intermediate step (s1 in this figure), subsequently use hard estimation (HE) and soft estimation (SE) to annotate this step based on all decoded answers.

We propose a framework to automatically construct process supervision for math reasoning tasks.
We draw inspiration from Monte-Carlo Tree Search and define the quality of an intermediate step as its potential to lead to the correct final answer. By leveraging the correctness of the answer, we can gather step-wise supervision automatically.
We explore the effectiveness of our method in two scenarios: verification and reinforcement learning.

2. Experimental Results

Math-Shepherd as Verifier

Performances of different LLMs on GSM8K and MATH with different verification strategies. The reward models are trained based on LLama2-70B and LLemma-34B on GSM8K and MATH, respectively. The verification is based on 256 outputs.

Performance of LLaMA2-70B using different verification strategies across different numbers of solution candidates on GSM8K and MATH.

Our automatic PRM MATH-Shepherd enhances the performance of various LLMs, surpassing other verification methods like Self-consistency and Outcome Reward Model (ORM).
Our MATH-Shepherd also performs better than PRM trained with the human annotated dataset PRM800K. We attribute this superiority to the distribution gap and the data quantity.

Math-Shepherd as Reward Model on Reinforcement Learning

Performances of different 7B models on GSM8K and MATH with greedy decoding. We use the questions in MetaMATH for RFT and PPO training. Both LLaMA2-7B and Mistral-7B are supervised by Mistral-7B-ORM and -Math-Shepherd.

Performances of different 7B models on GSM8K and MATH with greedy decoding. We use the questions in MetaMATH for RFT and PPO training. Both LLaMA2-7B and Mistral-7B are supervised by Mistral-7B-ORM and -Math-Shepherd.

Math-Shepherd as both Reward Model and Verifier

Results of reinforcement learning and verification combination. The reward models are trained based on Mistral-7B. The verification is based on 256 outputs.

Verification with Math-Shepherd can further improve the performance of RL-enhanced models.

Quality of the Automatic Process Annotation

The comparison between NLI/Rule-based automatic process annotation methods and our method.

    The comparison between NLI/Rule-based automatic process annotation methods and our method.