arxiv:2509.05542

DreamPRM-1.5: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training

Published on Sep 5, 2025

Authors:

Abstract

DreamPRM-1.5, an instance-reweighted framework using bi-level optimization, improves multimodal process reward model training by addressing distribution shifts and noisy data, achieving high accuracy on the MMMU benchmark.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Training multimodal process reward models (PRMs) is challenged by distribution shifts and noisy data. We introduce DreamPRM-1.5, an instance-reweighted framework that adaptively adjusts the importance of each training example via bi-level optimization. We design two complementary strategies: Instance Table, effective for smaller datasets, and Instance Net, scalable to larger ones. Integrated into test-time scaling, DreamPRM-1.5 achieves 84.6 accuracy on the MMMU benchmark, surpassing GPT-5.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2509.05542

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.05542 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.05542 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.05542 in a Space README.md to link it from this page.