Spaces:

OpenEvals
/

README

Running

New Benchmark Dataset

pinned

by burtenshaw - opened Jan 29

Jan 29

Are you maintaining an evaluation benchmark, and would like for it to be included in the eval results short list so that reported result appear as a leaderboard.

⭐️ comment and link to you dataset repo and sources using the benchmark.

adorkin

25 days ago

Not sure what are the specific requirements for benchmarks to be included, but we would like to have this functionality on these language specific benchmarks that we've built. They're quite recent so we don't have much sources yet beyond our own benchmarking efforts and EuroEval.

Manually translated and culturally adapted IFEval for Estonian.
https://huggingface.co/datasets/tartuNLP/ifeval_et

Manually translated and culturally adapted WinoGrande for Estonian.
https://huggingface.co/datasets/tartuNLP/winogrande_et

I'm not completely sure yet how to port the configs from LM Evaluation Harness to eval.yaml though.

yimingliang

21 days ago

Hi, we maintain Encyclo-K, a benchmark for evaluating LLMs with dynamically composed knowledge statements.

Dataset: https://huggingface.co/datasets/m-a-p/Encyclo-K
Paper: https://arxiv.org/abs/2512.24867
Leaderboard: https://encyclo-k.github.io/

We've added the eval.yaml file and would like to be included in the shortlist.

SaylorTwift

OpenEvals org 20 days ago

hey @yimingliang ! everything look great, we will add you to the shortlist and all should be set very impressive work on the evals, do you think it could be possible to open PRs on the models you evaluated with the results from your leaderboard ?

SaylorTwift

OpenEvals org 20 days ago

hey @adorkin ! Thanks for reaching out. IFEval would require custom code to run; this feature is not available yet, but will be in the future. For winogrande, you could absolutely make a eval.yaml file and turn it into a benchmark. You would need a small modification, though: The answer field should be either A or B instead of 1 or 2, and instead of having two columns for the choices, it would be easier to use one column with a list of choices. Then, your benchmark would simply be a multichoice benchmark :)

SaylorTwift pinned discussion 20 days ago

adorkin

20 days ago

@SaylorTwift I see, thanks! Is the yaml expected to contain the prompt itself? I mean it works well as a multiple choice problem, but nonetheless the formulation is a bit non-standard, because you're filling the gap rather than answering a question.

SaylorTwift

OpenEvals org 20 days ago

@adorkin yes you can set th eprompt in the yaml file like so https://huggingface.co/datasets/cais/hle/blob/main/eval.yaml. Using the multiple_choice solver instead of the system prompt. Here are the docs from inspect.

adorkin

20 days ago

@SaylorTwift I've added the eval.yaml and a custom dataset config to work with it. The dataset viewer seems to be stuck now which may or may not be related.
https://huggingface.co/datasets/tartuNLP/winogrande_et/blob/main/eval.yaml

SeaWolf-AI

8 days ago

📋 New Benchmark: FINAL Bench — Functional Metacognitive Reasoning

Dataset: https://huggingface.co/datasets/FINAL-Bench/Metacognitive

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in Large Language Models
(Taebong Kim, Minsik Kim, Sunyoung Choi, Jaewon Jang — currently under review)

Blog: https://huggingface.co/blog/FINAL-Bench/metacognitive

Leaderboard: https://huggingface.co/spaces/FINAL-Bench/Leaderboard

What it measures

FINAL Bench is the first benchmark for evaluating functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Unlike MMLU/GPQA that measure final-answer accuracy, FINAL Bench asks: "What did you do when you got it wrong?"

Key specs

100 tasks | 15 domains | 8 TICOS metacognitive types | 3 difficulty grades
5-axis rubric: MA (Metacognitive Accuracy), ER (Error Recovery), FA (Factual Accuracy), CO (Coherence), SP (Specificity)
Hidden cognitive traps (confirmation bias, anchoring, base-rate neglect) embedded in every task
9 SOTA models evaluated: GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, etc.
DOI: 10.57967/hf/7873

eval.yaml

eval.yaml has been added to the dataset repo.

We would love to be included in the benchmark shortlist! 🚀

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment