New Benchmark Dataset

#2
by burtenshaw - opened

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

image

⭐️ comment and link to you dataset repo and sources using the benchmark.

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

OpenEvals org

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

OpenEvals org

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

OpenEvals org

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

📋 New Benchmark: FINAL Bench — Functional Metacognitive Reasoning

Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models
(Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang — currently under review)

Blog: https://huggingface.co/blog/FINAL-Bench/metacognitive

Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

What it measures

FINAL Bench is the first benchmark for evaluating functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Unlike MMLU/GPQA that measure final-answer accuracy, FINAL Bench asks: "What did you do when you got it wrong?"

Key specs

  • 100 tasks | 15 domains | 8 TICOS metacognitive types | 3 difficulty grades
  • 5-axis rubric: MA (Metacognitive Accuracy), ER (Error Recovery), FA (Factual Accuracy), CO (Coherence), SP (Specificity)
  • Hidden cognitive traps (confirmation bias, anchoring, base-rate neglect) embedded in every task
  • 9 SOTA models evaluated: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, etc.
  • DOI: 10.57967/hf/7873

eval.yaml

eval.yaml has been added to the dataset repo.

We would love to be included in the benchmark shortlist! 🚀

Sign up or log in to comment