---
title: ragbench-rag-eval
emoji: "📊"
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# RAGBench RAG Evaluation Project

This project evaluates a RAG system on the RAGBench dataset across 5 domains:
Biomedical, General Knowledge, Legal, Customer Support, and Finance.


# RAGBench RAG Evaluation Project

This project evaluates a RAG system on the RAGBench dataset across 5 domains:
Biomedical, General Knowledge, Legal, Customer Support, and Finance.

## 1. Setup (local, no Docker)

```bash
python -m venv .venv
source .venv/bin/activate  # Windows: .venv\\Scripts\\activate
pip install --upgrade pip
pip install -r requirements.txt
```

Copy `.env.example` to `.env` and fill in:

- HF_TOKEN (if using Hugging Face models)
- GROQ_API_KEY (if using Groq)
- RAGBENCH_LLM_PROVIDER = groq or hf
- RAGBENCH_GEN_MODEL
- RAGBENCH_JUDGE_MODEL

Also open `prompts/ragbench_judge_prompt.txt` and paste the official JSON
annotation prompt from the RAGBench paper (Appendix 9.4), with placeholders:
`{documents}`, `{question}`, `{answer}`.

### Run an experiment from CLI

```bash
python -m scripts.run_experiment --domain biomedical --k 3 --max_examples 10
```

## 2. Run FastAPI locally (no Docker)

```bash
uvicorn app.main:app --host 0.0.0.0 --port 7860
```

Then open:

- `http://localhost:7860/health`
- `http://localhost:7860/docs` (Swagger UI)
- POST `/run_domain` with JSON:

```json
{
  "domain": "biomedical",
  "k": 3,
  "max_examples": 10,
  "split": "test"
}
```

## 3. Run with Docker (local laptop)

Build and run:

```bash
docker compose build
docker compose up
```

The API will be available at `http://localhost:8000`.

## 4. Deploy to Hugging Face Space (Docker)

1. Create a new Space with SDK = Docker.
2. Push this repo to the Space Git URL.
3. On the Space settings, add variables/secrets:

   - HF_TOKEN
   - GROQ_API_KEY
   - RAGBENCH_LLM_PROVIDER
   - RAGBENCH_GEN_MODEL
   - RAGBENCH_JUDGE_MODEL

4. Once the Space builds successfully, open `/docs` on the Space URL to run
`/run_domain` for each domain via Swagger UI.