<aside> 💡 Paper
</aside>
<aside> 📖 Dataset&Models
</aside>
<aside> 🔥 Twitter
</aside>
Comparison for previous automatic outcome annotation and our automatic process annotation. (a): automatic outcome annotation assigns a label to the entire solution S, dependent on the correctness of the answer; (b) automatic process annotation employs a ‘completer’ to finalize N reasoning processes (N=3 in this figure) for an intermediate step (s1 in this figure), subsequently use hard estimation (HE) and soft estimation (SE) to annotate this step based on all decoded answers.
Performances of different LLMs on GSM8K and MATH with different verification strategies. The reward models are trained based on LLama2-70B and LLemma-34B on GSM8K and MATH, respectively. The verification is based on 256 outputs.
Performance of LLaMA2-70B using different verification strategies across different numbers of solution candidates on GSM8K and MATH.
Performances of different 7B models on GSM8K and MATH with greedy decoding. We use the questions in MetaMATH for RFT and PPO training. Both LLaMA2-7B and Mistral-7B are supervised by Mistral-7B-ORM and -Math-Shepherd.
Results of reinforcement learning and verification combination. The reward models are trained based on Mistral-7B. The verification is based on 256 outputs.
The comparison between NLI/Rule-based automatic process annotation methods and our method.