WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora Paper • 2602.02053 • Published 1 day ago • 40
Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles Paper • 2602.01590 • Published 2 days ago • 32
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents Paper • 2602.01566 • Published 2 days ago • 42
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report Paper • 2601.08536 • Published 22 days ago • 3
DeepResearch Bench II: Diagnosing Deep Research Agents via Rubrics from Expert Report Paper • 2601.08536 • Published 22 days ago • 3
MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools Paper • 2509.09734 • Published Sep 10, 2025 • 16