Spaces:

MCP-1st-Birthday
/

TraceMind

Running

kshitijthakkar commited on 14 days ago

Commit

ae24574

1 Parent(s): 54d748d

docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing

- Add complete job submission documentation for HF Jobs and Modal
- Include accurate per-second pricing for both platforms
- HuggingFace Jobs: $0.40-2.50/hr (based on HF Spaces GPU pricing)
- Modal: $0.59-6.25/hr (verified rates from Modal pricing)
- Correct billing model: both platforms use per-second billing (no minimums)
- Add hardware selection guide with auto-selection logic
- Include cost estimation, monitoring, and troubleshooting sections
- Provide step-by-step submission workflow with examples
- Add cost comparison tables and optimization tips
- Update README.md with corrected technology stack details

References:
- https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs
- https://huggingface.co/docs/hub/en/spaces-gpus
- https://modal.com/pricing

Files changed (2) hide show

JOB_SUBMISSION.md +971 -0
README.md +2 -2

JOB_SUBMISSION.md ADDED Viewed

	@@ -0,0 +1,971 @@

+# Job Submission Guide
+This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.
+## Table of Contents
+- [Overview](#overview)
+- [Infrastructure Options](#infrastructure-options)
+  - [HuggingFace Jobs](#huggingface-jobs)
+  - [Modal](#modal)
+- [Prerequisites](#prerequisites)
+- [Hardware Selection Guide](#hardware-selection-guide)
+- [Submitting a Job](#submitting-a-job)
+- [Cost Estimation](#cost-estimation)
+- [Monitoring Jobs](#monitoring-jobs)
+- [Understanding Job Results](#understanding-job-results)
+- [Troubleshooting](#troubleshooting)
+- [Advanced Configuration](#advanced-configuration)
+---
+## Overview
+TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:
+1. **HuggingFace Jobs** - Managed compute with GPU/CPU options
+2. **Modal** - Serverless compute with pay-per-second billing
+Both platforms:
+- ✅ Run the same SMOLTRACE evaluation engine
+- ✅ Push results automatically to HuggingFace datasets
+- ✅ Appear in the TraceMind leaderboard when complete
+- ✅ Collect OpenTelemetry traces and GPU metrics
+- ✅ **Per-second billing** with no minimum duration
+**Choose based on your needs**:
+- **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month)
+- **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions
+**Pricing Sources**:
+- [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs)
+- [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus)
+- [Modal GPU Pricing](https://modal.com/pricing)
+---
+## Infrastructure Options
+### HuggingFace Jobs
+**What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.
+**Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges
+**Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)):
+- `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro)
+- `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro)
+- `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
+- `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
+- `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
+- `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
+- `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
+- `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
+- `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
+- `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
+- `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
+- `v5e-1x1` - Google Cloud TPU v5e (pricing TBD)
+- `v5e-2x2` - Google Cloud TPU v5e (pricing TBD)
+- `v5e-2x4` - Google Cloud TPU v5e (pricing TBD)
+*Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.*
+**Pros**:
+- Simple authentication (HuggingFace token)
+- Integrated with HF ecosystem
+- Job dashboard at https://huggingface.co/jobs
+- Reliable infrastructure
+**Cons**:
+- Requires HF Pro subscription ($9/month)
+- Slightly more expensive than Modal for most GPUs
+- Limited hardware options compared to Modal (no H100/H200)
+**When to use**:
+- ✅ You already have HF Pro subscription
+- ✅ You want simplicity and reliability
+- ✅ You prefer HuggingFace ecosystem integration
+- ✅ You prefer managed infrastructure
+### Modal
+**What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.
+**Pricing Model**: Pay-per-second usage (no subscription required)
+**Hardware Options**:
+- `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
+- `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
+- `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
+- `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
+- `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
+- `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
+- `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
+- `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
+- `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
+- `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)
+**Pros**:
+- Pay-per-second (no hourly minimums)
+- Wide range of GPUs (including H200, H100)
+- No subscription required
+- Real-time logs and monitoring
+- Fast cold starts
+**Cons**:
+- Requires Modal account setup
+- Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
+- Network egress charges apply
+- Less integrated with HF ecosystem
+**When to use**:
+- ✅ You want to minimize costs (generally cheaper than HF Jobs)
+- ✅ You need access to latest GPUs (H200, H100, B200)
+- ✅ You prefer serverless architecture
+- ✅ You don't have HF Pro subscription
+- ✅ You want more GPU options and flexibility
+---
+## Prerequisites
+### For Viewing Leaderboard (Free)
+**Required**:
+- HuggingFace account (free)
+- HuggingFace token with **Read** permissions
+**How to get**:
+1. Go to https://huggingface.co/settings/tokens
+2. Create new token with **Read** permission
+3. Copy token (starts with `hf_...`)
+4. Add to TraceMind Settings tab
+### For Submitting Jobs to HuggingFace Jobs
+**Required**:
+1. **HuggingFace Pro** subscription ($9/month)
+   - Sign up at https://huggingface.co/pricing
+   - **Must add credit card** for GPU compute charges
+2. HuggingFace token with **Read + Write + Run Jobs** permissions
+3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
+**How to setup**:
+1. Subscribe to HF Pro: https://huggingface.co/pricing
+2. Add credit card for compute charges
+3. Create token with all permissions:
+   - Go to https://huggingface.co/settings/tokens
+   - Click "New token"
+   - Select: **Read**, **Write**, **Run Jobs**
+   - Copy token
+4. Add API keys in TraceMind Settings:
+   - HuggingFace Token
+   - OpenAI API Key (if testing OpenAI models)
+   - Anthropic API Key (if testing Claude models)
+   - etc.
+### For Submitting Jobs to Modal
+**Required**:
+1. Modal account (free to create, pay-per-use)
+2. Modal API token (Token ID + Token Secret)
+3. HuggingFace token with **Read + Write** permissions
+4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
+**How to setup**:
+1. Create Modal account:
+   - Go to https://modal.com
+   - Sign up (GitHub or email)
+2. Create API token:
+   - Go to https://modal.com/settings/tokens
+   - Click "Create token"
+   - Copy **Token ID** (starts with `ak-...`)
+   - Copy **Token Secret** (starts with `as-...`)
+3. Add credentials in TraceMind Settings:
+   - Modal Token ID
+   - Modal Token Secret
+   - HuggingFace Token (Read + Write)
+   - LLM provider API keys
+---
+## Hardware Selection Guide
+### Auto-Selection (Recommended)
+Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on:
+- Model size (extracted from model name)
+- Provider type (API vs local)
+- Infrastructure (HF Jobs vs Modal)
+**Auto-selection logic**:
+**For API Models** (provider = `litellm` or `inference`):
+- Always uses **CPU** (no GPU needed)
+- HF Jobs: `cpu-basic`
+- Modal: `cpu`
+**For Local Models** (provider = `transformers`):
+*Memory estimation for agentic workloads*:
+- Model weights (FP16): ~2GB per 1B params
+- KV cache for long contexts: ~1.5-2x model size
+- Inference overhead: ~20-30% additional
+- **Total: ~4-5GB per 1B params for safe execution**
+**HuggingFace Jobs**:
+| Model Size | Hardware | VRAM | Example Models |
+|------------|----------|------|----------------|
+| < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini |
+| 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B |
+| 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B |
+| 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B |
+**Modal**:
+| Model Size | Hardware | VRAM | Example Models |
+|------------|----------|------|----------------|
+| < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini |
+| 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B |
+| 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B |
+| 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B |
+| 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B |
+| 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B |
+### Manual Selection
+If you know your model's requirements, you can manually select hardware:
+**CPU Jobs** (API models like GPT-4, Claude):
+- HF Jobs: `cpu-basic` or `cpu-upgrade`
+- Modal: `cpu`
+**Small Models** (1B-5B params):
+- HF Jobs: `t4-small` (16GB VRAM)
+- Modal: `gpu_t4` (16GB VRAM)
+- Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B
+**Medium Models** (6B-12B params):
+- HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM)
+- Modal: `gpu_l40s` (48GB VRAM)
+- Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B
+**Large Models** (13B-24B params):
+- HF Jobs: `a100-large` (80GB VRAM)
+- Modal: `gpu_a100_80gb` (80GB VRAM)
+- Examples: Llama-2-13B, Qwen-14B, Mistral-22B
+**Very Large Models** (25B+ params):
+- HF Jobs: `a100-large` (80GB VRAM) - may need quantization
+- Modal: `gpu_h200` (141GB VRAM) - recommended
+- Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B
+**Cost vs Performance Trade-offs**:
+- T4: Cheapest GPU, good for small models
+- L4: Newer architecture, better performance than T4
+- A10G: Good balance of cost/performance for medium models
+- L40S: Best for 7B-12B models (Modal only)
+- A100: Industry standard for large models
+- H200: Latest GPU, massive VRAM (141GB), best for 70B+ models
+---
+## Submitting a Job
+### Step 1: Navigate to New Evaluation Screen
+1. Open TraceMind-AI
+2. Click **▶️ New Evaluation** in the sidebar
+3. You'll see a comprehensive configuration form
+### Step 2: Configure Infrastructure
+**Infrastructure Provider**:
+- Choose `HuggingFace Jobs` or `Modal`
+**Hardware**:
+- Use `auto` (recommended) or select specific hardware
+- See [Hardware Selection Guide](#hardware-selection-guide)
+### Step 3: Configure Model
+**Model**:
+- Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`)
+- Use HuggingFace format: `organization/model-name`
+**Provider**:
+- `litellm` - For API models (OpenAI, Anthropic, etc.)
+- `inference` - For HuggingFace Inference API
+- `transformers` - For local models loaded with transformers
+**HF Inference Provider** (optional):
+- Leave empty unless using HF Inference API
+- Example: `openai-community/gpt2` for HF-hosted models
+**HuggingFace Token** (optional):
+- Leave empty if already configured in Settings
+- Only needed for private models
+### Step 4: Configure Agent
+**Agent Type**:
+- `tool` - Function calling agents only
+- `code` - Code execution agents only
+- `both` - Hybrid agents (recommended)
+**Search Provider**:
+- `duckduckgo` - Free, no API key required (recommended)
+- `serper` - Requires Serper API key
+- `brave` - Requires Brave Search API key
+**Enable Optional Tools**:
+- Select additional tools for the agent:
+  - `google_search` - Google Search (requires API key)
+  - `duckduckgo_search` - DuckDuckGo Search
+  - `visit_webpage` - Web page scraping
+  - `python_interpreter` - Python code execution
+  - `wikipedia_search` - Wikipedia queries
+  - `user_input` - User interaction (not recommended for batch eval)
+### Step 5: Configure Test Dataset
+**Dataset Name**:
+- Default: `kshitijthakkar/smoltrace-tasks`
+- Or use your own HuggingFace dataset
+- Format: `username/dataset-name`
+**Dataset Split**:
+- Default: `train`
+- Other options: `test`, `validation`
+**Difficulty Filter**:
+- `all` - All difficulty levels (recommended)
+- `easy` - Easy tasks only
+- `medium` - Medium tasks only
+- `hard` - Hard tasks only
+**Parallel Workers**:
+- Default: `1` (sequential execution)
+- Higher values (2-10) for faster execution
+- ⚠️ Increases memory usage and API rate limits
+### Step 6: Configure Output & Monitoring
+**Output Format**:
+- `hub` - Push to HuggingFace datasets (recommended)
+- `json` - Save locally (requires output directory)
+**Output Directory**:
+- Only for `json` format
+- Example: `./evaluation_results`
+**Enable OpenTelemetry Tracing**:
+- ✅ Recommended - Collects detailed execution traces
+- Traces appear in TraceMind trace visualization
+**Enable GPU Metrics**:
+- ✅ Recommended for GPU jobs
+- Collects GPU utilization, memory, temperature, CO2 emissions
+- No effect on CPU jobs
+**Private Datasets**:
+- ☐ Make result datasets private on HuggingFace
+- Default: Public datasets
+**Debug Mode**:
+- ☐ Enable verbose logging for troubleshooting
+- Default: Off
+**Quiet Mode**:
+- ☐ Reduce output verbosity
+- Default: Off
+**Run ID** (optional):
+- Auto-generated UUID if left empty
+- Custom ID for tracking specific runs
+**Job Timeout**:
+- Default: `1h` (1 hour)
+- Other examples: `30m`, `2h`, `3h`
+- Job will be terminated if it exceeds timeout
+### Step 7: Estimate Cost (Optional but Recommended)
+1. Click **💰 Estimate Cost** button
+2. Wait for AI-powered cost analysis
+3. Review:
+   - Estimated total cost
+   - Estimated duration
+   - Hardware selection (if auto)
+   - Historical data (if available)
+**Cost Estimation Sources**:
+- **Historical Data**: Based on previous runs of the same model in leaderboard
+- **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data)
+### Step 8: Submit Job
+1. Review all configurations
+2. Click **🚀 Submit Evaluation** button
+3. Wait for confirmation message
+4. Copy job ID for tracking
+**Confirmation message includes**:
+- ✅ Job submission status
+- Job ID and platform-specific ID
+- Hardware selected
+- Estimated duration
+- Monitoring instructions
+### Example: Submit HuggingFace Jobs Evaluation
+```
+Infrastructure: HuggingFace Jobs
+Hardware: auto → a10g-large
+Model: meta-llama/Llama-3.1-8B-Instruct
+Provider: transformers
+Agent Type: both
+Dataset: kshitijthakkar/smoltrace-tasks
+Output Format: hub
+Click "Estimate Cost":
+→ Estimated Cost: $1.25
+→ Duration: 25 minutes
+→ Hardware: a10g-large (auto-selected)
+Click "Submit Evaluation":
+→ ✅ Job submitted successfully!
+→ HF Job ID: username/job_abc123
+→ Monitor at: https://huggingface.co/jobs
+```
+### Example: Submit Modal Evaluation
+```
+Infrastructure: Modal
+Hardware: auto → L40S
+Model: meta-llama/Llama-3.1-8B-Instruct
+Provider: transformers
+Agent Type: both
+Dataset: kshitijthakkar/smoltrace-tasks
+Output Format: hub
+Click "Estimate Cost":
+→ Estimated Cost: $0.95
+→ Duration: 20 minutes
+→ Hardware: gpu_l40s (auto-selected)
+Click "Submit Evaluation":
+→ ✅ Job submitted successfully!
+→ Modal Call ID: modal-job_xyz789
+→ Monitor at: https://modal.com/apps
+```
+---
+## Cost Estimation
+### Understanding Cost Estimates
+TraceMind provides AI-powered cost estimation before you submit jobs:
+**Historical Data** (most accurate):
+- Based on actual runs of the same model
+- Shows average cost, duration from past evaluations
+- Displays number of historical runs used
+**MCP AI Analysis** (when no historical data):
+- Powered by Google Gemini 2.5 Flash
+- Analyzes model size, hardware, provider
+- Estimates cost based on typical usage patterns
+- Includes detailed breakdown and recommendations
+### Cost Factors
+**For HuggingFace Jobs**:
+1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs))
+2. **Evaluation duration** (actual runtime only, billed per-second)
+3. **LLM API costs** (if using API models like GPT-4)
+4. **HF Pro subscription** ($9/month required)
+**For Modal**:
+1. **Hardware per-second rate** (no minimums)
+2. **Evaluation duration** (actual runtime only)
+3. **Network egress** (data transfer out)
+4. **LLM API costs** (if using API models)
+### Cost Optimization Tips
+**Use Auto Hardware Selection**:
+- Automatically picks cheapest hardware for your model
+- Avoids over-provisioning (e.g., H200 for 3B model)
+**Choose Right Infrastructure**:
+- **If you have HF Pro**: Use HF Jobs (already paying subscription)
+- **If you don't have HF Pro**: Use Modal (no subscription required)
+- **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these)
+**Optimize Model Selection**:
+- Smaller models (3B-7B) are 10x cheaper than large models (70B)
+- API models (GPT-4-mini) often cheaper than local 70B models
+**Reduce Test Count**:
+- Use difficulty filter (`easy` only) for quick validation
+- Test with small dataset first, then scale up
+**Parallel Workers**:
+- Keep at 1 for sequential execution (cheapest)
+- Increase only if time is critical (increases API costs)
+**Example Cost Comparison**:
+| Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost |
+|-------|----------|----------------|----------|--------------|------------|
+| GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* |
+| Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A |
+| Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** |
+| Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** |
+| Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** |
+\* Plus LLM API costs (OpenAI/Anthropic/etc. - not included)
+\** Per-second billing, actual runtime only (no minimums)
+---
+## Monitoring Jobs
+### HuggingFace Jobs
+**Via HuggingFace Dashboard**:
+1. Go to https://huggingface.co/jobs
+2. Find your job in the list
+3. Click to view details and logs
+**Via TraceMind Job Monitoring Tab**:
+1. Click **📈 Job Monitoring** in sidebar
+2. See all your submitted jobs
+3. Real-time status updates
+4. Click job to view logs
+**Job Statuses**:
+- `pending` - Waiting for resources
+- `running` - Currently executing
+- `completed` - Finished successfully
+- `failed` - Error occurred (check logs)
+- `cancelled` - Manually stopped
+### Modal
+**Via Modal Dashboard**:
+1. Go to https://modal.com/apps
+2. Find your app: `smoltrace-eval-{job_id}`
+3. Click to view real-time logs and metrics
+**Via TraceMind Job Monitoring Tab**:
+1. Click **📈 Job Monitoring** in sidebar
+2. See all your submitted jobs
+3. Modal jobs show as `submitted` (check Modal dashboard for details)
+### Viewing Job Logs
+**HuggingFace Jobs**:
+```
+1. Go to Job Monitoring tab
+2. Click on your job
+3. Click "View Logs" button
+4. See real-time output from SMOLTRACE
+```
+**Modal**:
+```
+1. Go to https://modal.com/apps
+2. Find your app
+3. Click "Logs" tab
+4. See streaming output in real-time
+```
+### Expected Job Duration
+**API Models** (litellm provider):
+- CPU job: 2-5 minutes for 100 tests
+- No model download required
+- Depends on API rate limits
+**Local Models** (transformers provider):
+- Model download: 5-15 minutes (one-time per job)
+  - 3B model: ~6GB download
+  - 8B model: ~16GB download
+  - 70B model: ~140GB download
+- Evaluation: 10-30 minutes for 100 tests
+- Total: 15-45 minutes typical
+**Progress Indicators**:
+1. ⏳ Job queued (0-2 minutes)
+2. 🔄 Downloading model (5-15 minutes for first run)
+3. 🧪 Running evaluation (10-30 minutes)
+4. 📤 Uploading results to HuggingFace (1-2 minutes)
+5. ✅ Complete
+---
+## Understanding Job Results
+### Where Results Are Stored
+**HuggingFace Datasets** (if output_format = "hub"):
+SMOLTRACE creates 4 datasets for each evaluation:
+1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard`
+   - Aggregate statistics for the run
+   - Appears in TraceMind Leaderboard tab
+   - Public, shared across all users
+2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}`
+   - Individual test case results
+   - Success/failure, execution time, tokens, cost
+   - Links to traces dataset
+3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}`
+   - OpenTelemetry traces (if enable_otel = True)
+   - Detailed execution steps, LLM calls, tool usage
+   - Viewable in TraceMind Trace Visualization
+4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}`
+   - GPU metrics (if enable_gpu_metrics = True)
+   - GPU utilization, memory, temperature, CO2 emissions
+   - Time-series data for each test
+**Local JSON Files** (if output_format = "json"):
+- Saved to `output_dir` on the job machine
+- Not automatically uploaded to HuggingFace
+- Useful for local testing
+### Viewing Results in TraceMind
+**Step 1: Refresh Leaderboard**
+1. Go to **📊 Leaderboard** tab
+2. Click **Load Leaderboard** button
+3. Your new run appears in the table
+**Step 2: View Run Details**
+1. Click on your run in the leaderboard
+2. See detailed test results:
+   - Individual test cases
+   - Success/failure breakdown
+   - Execution times
+   - Token usage
+   - Costs
+**Step 3: Visualize Traces** (if enable_otel = True)
+1. From run details, click on a test case
+2. Click **View Trace** button
+3. See OpenTelemetry waterfall diagram
+4. Analyze:
+   - LLM calls and durations
+   - Tool executions
+   - Reasoning steps
+   - GPU metrics overlay (if GPU job)
+**Step 4: Ask Questions About Results**
+1. Go to **🤖 Agent Chat** tab
+2. Ask questions like:
+   - "Analyze my latest evaluation run"
+   - "Why did test case 5 fail?"
+   - "Compare my run with the top model"
+   - "What was the cost breakdown?"
+### Interpreting Results
+**Key Metrics**:
+| Metric | Description | Good Value |
+|--------|-------------|------------|
+| **Success Rate** | % of tests passed | >90% excellent, >70% good |
+| **Avg Duration** | Time per test case | <5s good, <10s acceptable |
+| **Total Cost** | Cost for all tests | Varies by model |
+| **Tokens Used** | Total tokens consumed | Lower is better |
+| **CO2 Emissions** | Carbon footprint | Lower is better |
+| **GPU Utilization** | GPU usage % | >60% efficient |
+**Common Patterns**:
+**High accuracy, low cost**:
+- ✅ Excellent model for production
+- Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash
+**High accuracy, high cost**:
+- ✅ Best for quality-critical tasks
+- Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro
+**Low accuracy, low cost**:
+- ⚠️ May need prompt optimization or better model
+- Examples: Small local models (<3B params)
+**Low accuracy, high cost**:
+- ❌ Poor choice, investigate or switch models
+- May indicate configuration issues
+---
+## Troubleshooting
+### Job Submission Failures
+**Error: "HuggingFace token not configured"**
+- **Cause**: Missing or invalid HF token
+- **Fix**:
+  1. Go to Settings tab
+  2. Add HF token with "Read + Write + Run Jobs" permissions
+  3. Click "Save API Keys"
+**Error: "HuggingFace Pro subscription required"**
+- **Cause**: HF Jobs requires Pro subscription
+- **Fix**:
+  1. Subscribe at https://huggingface.co/pricing ($9/month)
+  2. Add credit card for GPU charges
+  3. Try again
+**Error: "Modal credentials not configured"**
+- **Cause**: Missing Modal API tokens
+- **Fix**:
+  1. Go to https://modal.com/settings/tokens
+  2. Create new token
+  3. Copy Token ID and Token Secret
+  4. Add to Settings tab
+  5. Try again
+**Error: "Modal package not installed"**
+- **Cause**: Modal SDK missing (should not happen in hosted Space)
+- **Fix**: Contact support or run locally with `pip install modal`
+### Job Execution Failures
+**Job stuck in "Pending" status**
+- **Cause**: High demand for GPU resources
+- **Fix**:
+  - Wait 5-10 minutes
+  - Try different hardware (e.g., T4 instead of A100)
+  - Try different infrastructure (Modal vs HF Jobs)
+**Job fails with "Out of Memory"**
+- **Cause**: Model too large for selected hardware
+- **Fix**:
+  - Use larger GPU (A100-80GB or H200)
+  - Or use `auto` hardware selection
+  - Or reduce `parallel_workers` to 1
+**Job fails with "Model not found"**
+- **Cause**: Invalid model ID or private model
+- **Fix**:
+  - Check model ID format: `organization/model-name`
+  - For private models, add HF token with access
+  - Verify model exists on HuggingFace Hub
+**Job fails with "API key not set"**
+- **Cause**: Missing LLM provider API key
+- **Fix**:
+  1. Go to Settings tab
+  2. Add API key for your provider (OpenAI, Anthropic, etc.)
+  3. Submit job again
+**Job fails with "Rate limit exceeded"**
+- **Cause**: Too many API requests
+- **Fix**:
+  - Reduce `parallel_workers` to 1
+  - Use different model with higher rate limits
+  - Wait and retry later
+**Modal job fails with "Authentication failed"**
+- **Cause**: Invalid Modal tokens
+- **Fix**:
+  1. Go to https://modal.com/settings/tokens
+  2. Create new token (old one may be expired)
+  3. Update tokens in Settings tab
+### Results Not Appearing
+**Results not in leaderboard after job completes**
+- **Cause**: Dataset upload failed or not configured
+- **Fix**:
+  - Check job logs for errors
+  - Verify `output_format` was set to "hub"
+  - Verify HF token has "Write" permission
+  - Manually refresh leaderboard (click "Load Leaderboard")
+**Traces not appearing**
+- **Cause**: OpenTelemetry not enabled
+- **Fix**:
+  - Re-run evaluation with `enable_otel = True`
+  - Check traces dataset exists on your HF profile
+**GPU metrics not showing**
+- **Cause**: GPU metrics not enabled or CPU job
+- **Fix**:
+  - Re-run with `enable_gpu_metrics = True`
+  - Verify job used GPU hardware (not CPU)
+  - Check metrics dataset exists
+---
+## Advanced Configuration
+### Custom Test Datasets
+**Create your own test dataset**:
+1. Use **🔬 Synthetic Data Generator** tab:
+   - Configure domain and tools
+   - Generate custom tasks
+   - Push to HuggingFace Hub
+2. Use generated dataset in evaluation:
+   - Set `dataset_name` to your dataset: `{username}/dataset-name`
+   - Configure agent with matching tools
+**Dataset Format Requirements**:
+```python
+{
+    "task_id": "task_001",
+    "prompt": "What's the weather in Tokyo?",
+    "expected_tool": "get_weather",
+    "difficulty": "easy",
+    "category": "tool_usage"
+}
+```
+### Environment Variables
+**LLM Provider API Keys** (in Settings):
+- `OPENAI_API_KEY` - OpenAI API
+- `ANTHROPIC_API_KEY` - Anthropic API
+- `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API
+- `COHERE_API_KEY` - Cohere API
+- `MISTRAL_API_KEY` - Mistral API
+- `TOGETHER_API_KEY` - Together AI API
+- `GROQ_API_KEY` - Groq API
+- `REPLICATE_API_TOKEN` - Replicate API
+- `ANYSCALE_API_KEY` - Anyscale API
+**Infrastructure Credentials**:
+- `HF_TOKEN` - HuggingFace token
+- `MODAL_TOKEN_ID` - Modal token ID
+- `MODAL_TOKEN_SECRET` - Modal token secret
+### Parallel Execution
+**Use `parallel_workers` to speed up evaluation**:
+- `1` - Sequential execution (default, safest)
+- `2-4` - Moderate parallelism (2-4x faster)
+- `5-10` - High parallelism (5-10x faster, risky)
+**Trade-offs**:
+- ✅ **Faster**: Linear speedup with workers
+- ⚠️ **Higher cost**: More API calls per minute
+- ⚠️ **Rate limits**: May hit provider rate limits
+- ⚠️ **Memory**: Increases GPU memory usage
+**Recommendations**:
+- API models: Keep at 1 (avoid rate limits)
+- Local models: Can use 2-4 if GPU has enough VRAM
+- Production runs: Use 1 for reliability
+### Private Datasets
+**Make results private**:
+1. Set `private = True` in job configuration
+2. Results will be private on your HuggingFace profile
+3. Only you can view in leaderboard (if using private leaderboard dataset)
+**Use cases**:
+- Proprietary models
+- Confidential evaluation data
+- Internal benchmarking
+---
+## Quick Reference
+### Job Submission Checklist
+Before submitting a job, verify:
+- [ ] Infrastructure selected (HF Jobs or Modal)
+- [ ] Hardware configured (auto or manual)
+- [ ] Model ID is correct
+- [ ] Provider matches model type
+- [ ] API keys configured in Settings
+- [ ] Dataset name is valid
+- [ ] Output format is "hub" for TraceMind integration
+- [ ] OpenTelemetry tracing enabled (if you want traces)
+- [ ] GPU metrics enabled (if using GPU)
+- [ ] Cost estimate reviewed
+- [ ] Timeout is sufficient for your model size
+### Common Model Configurations
+**OpenAI GPT-4**:
+```
+Model: openai/gpt-4
+Provider: litellm
+Hardware: auto → cpu-basic
+Infrastructure: Either (HF Jobs or Modal)
+Estimated Cost: API costs only
+```
+**Anthropic Claude-3.5-Sonnet**:
+```
+Model: anthropic/claude-3.5-sonnet
+Provider: litellm
+Hardware: auto → cpu-basic
+Infrastructure: Either (HF Jobs or Modal)
+Estimated Cost: API costs only
+```
+**Meta Llama-3.1-8B**:
+```
+Model: meta-llama/Llama-3.1-8B-Instruct
+Provider: transformers
+Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
+Infrastructure: Modal (cheaper for short jobs)
+Estimated Cost: $0.75-1.50
+```
+**Meta Llama-3.1-70B**:
+```
+Model: meta-llama/Llama-3.1-70B-Instruct
+Provider: transformers
+Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
+Infrastructure: Modal (if available), else HF Jobs
+Estimated Cost: $3.00-8.00
+```
+**Qwen-2.5-Coder-32B**:
+```
+Model: Qwen/Qwen2.5-Coder-32B-Instruct
+Provider: transformers
+Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
+Infrastructure: Either
+Estimated Cost: $2.00-4.00
+```
+---
+## Next Steps
+After submitting your first job:
+1. **Monitor progress** in Job Monitoring tab
+2. **View results** in Leaderboard when complete
+3. **Analyze traces** in Trace Visualization
+4. **Ask questions** in Agent Chat about your results
+5. **Compare** with other models using Compare feature
+6. **Optimize** model selection based on cost/accuracy trade-offs
+7. **Generate** custom test datasets for your domain
+8. **Share** your results with the community
+For more help:
+- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
+- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details
+- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview
+- GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)

README.md CHANGED Viewed

@@ -327,9 +327,9 @@ To prevent rate limits during evaluation:
 - **Agent Framework**: smolagents 1.22.0+
 - **MCP Integration**: MCP Python SDK + smolagents MCPClient
 - **Data Source**: HuggingFace Datasets API
-- **Authentication**: HuggingFace OAuth
 - **AI Models**:
-  - Agent: Qwen/Qwen2.5-Coder-32B-Instruct (HF API)
   - MCP Server: Google Gemini 2.5 Flash
 - **Cloud Platforms**: HuggingFace Jobs + Modal

 - **Agent Framework**: smolagents 1.22.0+
 - **MCP Integration**: MCP Python SDK + smolagents MCPClient
 - **Data Source**: HuggingFace Datasets API
+- **Authentication**: HuggingFace OAuth (planned)
 - **AI Models**:
+  - Agent: Google Gemini 2.5 Flash
   - MCP Server: Google Gemini 2.5 Flash
 - **Cloud Platforms**: HuggingFace Jobs + Modal