SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 15 days ago • 52
mlfoundations-dev/tulu-3-sft-personas-algebra-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.95k • 20
mlfoundations-dev/tulu-3-sft-personas-math-grade-filtered-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.29k • 9
mlfoundations-dev/wizardlm_orca-evol-instruct-110k-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 10k • 16 • 1
mlfoundations-dev/magicoder-evol-instruct-110k-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.98k • 12 • 1
mlfoundations-dev/stackexchange-codereview-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 9.99k • 6
mlfoundations-dev/glaive-code-assistant-sandboxes-traces-terminus-2 Viewer • Updated Oct 4, 2025 • 8.51k • 19