RPC-Bench

A Fine-grained Benchmark for Research PaperComprehension

Introduction

Understanding research papers remains challenging for foundation models due to specialized scientific discourse and complex figures and tables, yet existing benchmarks offer limited fine-grained evaluation at scale. To address this gap, we introduce RPC-Bench, a large-scale question-answering benchmark built from review–rebuttal exchanges of high-quality computer science papers, containing 15K human-verified QA pairs. We design a fine-grained taxonomy aligned with the scientific research flow to assess models’ ability to understand and answer why, what, and how questions in scholarly contexts. We also define an elaborate LLM–human interaction annotation framework to support large-scale labeling and quality control. Following the LLM-as-a-Judge paradigm, we develop a scalable framework that evaluates models on correctness-completeness and conciseness, with high agreement to human judgment. Experiments reveal that even the strongest models (GPT-5) achieve only 68.2% correctness-completeness, dropping to 37.46% after conciseness adjustment, highlighting substantial gaps in precise academic paper understanding.

Leaderboard

Evaluation results on RPC-Bench.

# Model Input
Config
Date Concise. (%) Correct. (%) Complete. (%) F1-like (%) Info. (%)
GPT-5

OpenAI

TEXT 2025-8-7 54.93 69.10 67.33 68.20 37.46
GPT-5.2

OpenAI

TEXT 2025-12-11 53.81 66.84 64.03 65.40 35.19
GPT-5

OpenAI

VISUAL 2025-8-7 61.47 58.90 55.34 57.07 35.08
Gemini-2.5-Pro

Google

TEXT 2025-3-25 54.87 62.65 59.03 60.79 33.35
Gemini-3-Pro

Google

TEXT 2025-11-18 52.81 62.69 60.28 61.46 32.46
DeepSeek-V3.2

DeepSeek-AI

TEXT 2025-12-1 56.31 58.73 55.19 56.91 32.04
GPT-5.2

OpenAI

VISUAL 2025-12-11 56.43 56.75 52.82 54.72 30.88
DeepSeek-V3.1

DeepSeek-AI

TEXT 2025-8-21 54.76 57.85 54.85 56.31 30.84
GLM-4.6V

Z.ai

VISUAL 2025-12-8 64.55 47.32 43.43 45.29 29.23
GLM-4.7

Z.ai

TEXT 2025-12-22 54.34 54.36 51.75 53.02 28.81
GLM-4.5V

Z.ai

VISUAL 2025-8-11 59.44 48.79 43.62 46.06 27.38
gemini-3-pro

Google

VISUAL 2025-11-18 50.22 56.06 52.69 54.32 27.28
GLM-4.5

Z.ai

TEXT 2025-7-28 43.41 58.95 59.54 59.24 25.72
gemini-2.5-pro

Google

VISUAL 2025-3-25 51.71 48.39 45.59 46.95 24.28
Claude-Sonnet-4

Anthropic

TEXT 2025-5-23 41.37 58.53 58.44 58.48 24.19
Qwen3

Alibaba

TEXT 2025-7-21 41.44 55.88 56.64 56.26 23.31
Claude-Sonnet-4.5

Anthropic

TEXT 2025-9-30 31.02 64.31 64.97 64.64 20.05
Claude-Sonnet-4.5

Anthropic

VISUAL 2025-9-30 31.95 55.35 54.45 54.89 17.54
Claude-Sonnet-4

Anthropic

VISUAL 2025-5-23 31.63 54.16 53.32 53.74 16.99
HippoRAG2

The Ohio State University

TEXT 2025-6-19 45.77 33.13 27.88 30.28 13.86
MemoRAG

Peking University & Hong Kong Polytechnic University

TEXT 2025-4-9 51.31 24.19 19.10 21.35 10.96
VdocRAG

NTT Corporation & Tohoku University

VISUAL 2025-4-14 61.54 21.17 13.88 16.77 10.32
VisRAG

Tsinghua University & ModelBest Inc.

VISUAL 2025-3-2 39.90 26.24 23.63 24.87 9.92
Raptor

Stanford University

TEXT 2024-1-31 36.47 25.28 20.82 22.84 8.33
Monkey

Huazhong University of Science and Technology

VISUAL 2024-8-26 54.61 17.08 11.27 13.58 7.41
Docopilot

Shanghai AI Laboratory

VISUAL 2025-7-19 39.31 18.31 17.12 17.69 6.96
Qwen3

Alibaba

VISUAL 2025-7-21 22.64 20.17 20.14 20.16 4.56
DocOwl2

Alibaba

VISUAL 2024-9-9 50.19 11.75 6.66 8.50 4.27

Green date indicates the newly added/updated models.

RPC-Bench

Comparison with relevant research paper Benchmark

Compared to existing benchmarks, RPC-Bench is designed to evaluate in-depth paper comprehension under realistic settings. Conc.=Conciseness; Corr.=Correctness; F1-like is defined as the harmonic mean of correctness and completeness; inp.=input. ``Eval. Metrics'' are LLM-based metrics.

Construction Pipeline

The overall framework for benchmark construction. We crawl papers and review–rebuttal pairs from OpenReview and apply impact-aware sampling to balance quality and mitigate bias. Review-rebuttals are segmented into comment–response units with GPT- 4o, rewritten into QA pairs using GLM-4-Plus and DeepSeek-V3. Low-quality QA items are discarded before iterative human annotation and review.

Taxonomy Design

We design a taxonomy aligned with the natural research flow of academic papers. It begins with what-questions, which focus on clarifying fundamental concepts and contextual background. It then advances to how-questions, which probe the mechanics of methods and experimental setups. Finally, it deepens into why-questions, which examine the underlying motivations of methods and the reasoning behind results. The form of [What-4.27%] indicates question types and QA percentage.

Basic Statistics of MotionBench

A/M Q: average/max question length. A/M A: average/max answer length. Lengths are measured in words.

Domain distribution of RPC-Bench. ML: Machine Learning; CV: Computer Vision; NLP: Natural Language Processing; RL: Reinforcement Learning.

LLM-as-Judge Evaluation Framework

We empirically validate an evaluation framework that is highly consistent with human assessment: judges are supplied with sufficient task context (title and abstract), evaluate each dimension independently, and the two models exhibiting the strongest agreement with human assessments are jointly employed to reduce single-judge bias.

Evaluation Protocol

For binary classification with clear ground-truth labels, we use accuracy as the primary metric. For open-ended QA, traditional automatic metrics (e.g., BLEU, BERTScore) often fail to capture answer quality since many semantically equivalent responses exist. Following recent work on LLM-as-a-Judge, we adopt an LLM-based scoring scheme that evaluates each answer along three dimensions: conciseness (brevity without irrelevant content), correctness (accuracy and fidelity, akin to precision), and completeness (coverage of essential content, akin to recall). Each is rated on a 0–5 scale. We also compute two derived metrics: an F1-like score (harmonic mean of correctness and completeness) and informativeness, the aggregate of all three dimensions to avoid verbose and repeated outputs.

where β controls the weight between correctness and completeness (β=1 by default). This captures the F1-like balance of correctness and completeness, with conciseness penalizing verbosity.

Experimental Results

Main Results

Evaluation results of free-form QA on the test set. R-L=ROUGE-L; BERTS.=BERTScore; Concise.=Conciseness.; Correct.=Correctness; Complete. = Completeness; Info. = Informativeness. The best results are highlighted in bold, and the second-best results are underlined.

Performance across Task Categories

Comparison of LLMs and VLMs on open-ended question answering (Conciseness and F1-like score).

The performance of all models on claim verification tasks (ACC).

Common Failure Modes

Example 1 (Degenerative Output Patterns): The model’s decoding collapses into uninformative content, highlighting the importance of long‑form generation tasks to stress‑test stability.

Example 2 (Necessity of Multimodal Grounding): A text‑only model extracts conclusions from text, while a multimodal model grounds claims in visual evidence (e.g., watermarks in Figure 10), revealing cross‑modal reasoning capabilities beyond text‑only evaluation.

Example 3 (Hallucination): The model sometimes wrongly denies information that is actually present in the source document, highlighting the need for tasks that test precise data extraction and catch such factual-verification errors.

Example 4 (Precise Output Failures): Despite the prompt explicitly constraining the output format to strict booleans (True/False), both models violate this requirement: one returns a self‑contradictory invalid answer, while the other appends extraneous characters.