Spaces:
Running
docs: Add comprehensive JOB_SUBMISSION.md guide with accurate pricing
Browse files- Add complete job submission documentation for HF Jobs and Modal
- Include accurate per-second pricing for both platforms
- HuggingFace Jobs: $0.40-2.50/hr (based on HF Spaces GPU pricing)
- Modal: $0.59-6.25/hr (verified rates from Modal pricing)
- Correct billing model: both platforms use per-second billing (no minimums)
- Add hardware selection guide with auto-selection logic
- Include cost estimation, monitoring, and troubleshooting sections
- Provide step-by-step submission workflow with examples
- Add cost comparison tables and optimization tips
- Update README.md with corrected technology stack details
References:
- https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs
- https://huggingface.co/docs/hub/en/spaces-gpus
- https://modal.com/pricing
- JOB_SUBMISSION.md +971 -0
- README.md +2 -2
|
@@ -0,0 +1,971 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Job Submission Guide
|
| 2 |
+
|
| 3 |
+
This guide explains how to submit agent evaluation jobs to run on cloud infrastructure using TraceMind-AI.
|
| 4 |
+
|
| 5 |
+
## Table of Contents
|
| 6 |
+
|
| 7 |
+
- [Overview](#overview)
|
| 8 |
+
- [Infrastructure Options](#infrastructure-options)
|
| 9 |
+
- [HuggingFace Jobs](#huggingface-jobs)
|
| 10 |
+
- [Modal](#modal)
|
| 11 |
+
- [Prerequisites](#prerequisites)
|
| 12 |
+
- [Hardware Selection Guide](#hardware-selection-guide)
|
| 13 |
+
- [Submitting a Job](#submitting-a-job)
|
| 14 |
+
- [Cost Estimation](#cost-estimation)
|
| 15 |
+
- [Monitoring Jobs](#monitoring-jobs)
|
| 16 |
+
- [Understanding Job Results](#understanding-job-results)
|
| 17 |
+
- [Troubleshooting](#troubleshooting)
|
| 18 |
+
- [Advanced Configuration](#advanced-configuration)
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Overview
|
| 23 |
+
|
| 24 |
+
TraceMind-AI allows you to submit SMOLTRACE evaluation jobs to two cloud platforms:
|
| 25 |
+
|
| 26 |
+
1. **HuggingFace Jobs** - Managed compute with GPU/CPU options
|
| 27 |
+
2. **Modal** - Serverless compute with pay-per-second billing
|
| 28 |
+
|
| 29 |
+
Both platforms:
|
| 30 |
+
- ✅ Run the same SMOLTRACE evaluation engine
|
| 31 |
+
- ✅ Push results automatically to HuggingFace datasets
|
| 32 |
+
- ✅ Appear in the TraceMind leaderboard when complete
|
| 33 |
+
- ✅ Collect OpenTelemetry traces and GPU metrics
|
| 34 |
+
- ✅ **Per-second billing** with no minimum duration
|
| 35 |
+
|
| 36 |
+
**Choose based on your needs**:
|
| 37 |
+
- **HuggingFace Jobs**: Best if you already have HF Pro subscription ($9/month)
|
| 38 |
+
- **Modal**: Best if you need H200/H100 GPUs or want to avoid subscriptions
|
| 39 |
+
|
| 40 |
+
**Pricing Sources**:
|
| 41 |
+
- [HuggingFace Jobs Documentation](https://huggingface.co/docs/huggingface_hub/main/en/guides/jobs)
|
| 42 |
+
- [HuggingFace Spaces GPU Pricing](https://huggingface.co/docs/hub/en/spaces-gpus)
|
| 43 |
+
- [Modal GPU Pricing](https://modal.com/pricing)
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## Infrastructure Options
|
| 48 |
+
|
| 49 |
+
### HuggingFace Jobs
|
| 50 |
+
|
| 51 |
+
**What it is**: Managed compute platform from HuggingFace with dedicated GPU/CPU instances.
|
| 52 |
+
|
| 53 |
+
**Pricing Model**: Subscription-based ($9/month HF Pro) + **per-second** GPU charges
|
| 54 |
+
|
| 55 |
+
**Hardware Options** (pricing from [HF Spaces GPU pricing](https://huggingface.co/docs/hub/en/spaces-gpus)):
|
| 56 |
+
- `cpu-basic` - 2 vCPU, 16GB RAM (Free with Pro)
|
| 57 |
+
- `cpu-upgrade` - 8 vCPU, 32GB RAM (Free with Pro)
|
| 58 |
+
- `t4-small` - NVIDIA T4 16GB, 4 vCPU, 15GB RAM ($0.40/hr = $0.000111/sec)
|
| 59 |
+
- `t4-medium` - NVIDIA T4 16GB, 8 vCPU, 30GB RAM ($0.60/hr = $0.000167/sec)
|
| 60 |
+
- `l4x1` - NVIDIA L4 24GB, 8 vCPU, 30GB RAM ($0.80/hr = $0.000222/sec)
|
| 61 |
+
- `l4x4` - 4x NVIDIA L4 96GB total, 48 vCPU, 186GB RAM ($3.80/hr = $0.001056/sec)
|
| 62 |
+
- `a10g-small` - NVIDIA A10G 24GB ($1.00/hr = $0.000278/sec)
|
| 63 |
+
- `a10g-large` - NVIDIA A10G 24GB (more compute) ($1.50/hr = $0.000417/sec)
|
| 64 |
+
- `a10g-largex2` - 2x NVIDIA A10G 48GB total ($3.00/hr = $0.000833/sec)
|
| 65 |
+
- `a10g-largex4` - 4x NVIDIA A10G 96GB total ($5.00/hr = $0.001389/sec)
|
| 66 |
+
- `a100-large` - NVIDIA A100 80GB, 12 vCPU, 142GB RAM ($2.50/hr = $0.000694/sec)
|
| 67 |
+
- `v5e-1x1` - Google Cloud TPU v5e (pricing TBD)
|
| 68 |
+
- `v5e-2x2` - Google Cloud TPU v5e (pricing TBD)
|
| 69 |
+
- `v5e-2x4` - Google Cloud TPU v5e (pricing TBD)
|
| 70 |
+
|
| 71 |
+
*Note: Jobs billing is **per-second** with no minimum. You only pay for actual compute time used.*
|
| 72 |
+
|
| 73 |
+
**Pros**:
|
| 74 |
+
- Simple authentication (HuggingFace token)
|
| 75 |
+
- Integrated with HF ecosystem
|
| 76 |
+
- Job dashboard at https://huggingface.co/jobs
|
| 77 |
+
- Reliable infrastructure
|
| 78 |
+
|
| 79 |
+
**Cons**:
|
| 80 |
+
- Requires HF Pro subscription ($9/month)
|
| 81 |
+
- Slightly more expensive than Modal for most GPUs
|
| 82 |
+
- Limited hardware options compared to Modal (no H100/H200)
|
| 83 |
+
|
| 84 |
+
**When to use**:
|
| 85 |
+
- ✅ You already have HF Pro subscription
|
| 86 |
+
- ✅ You want simplicity and reliability
|
| 87 |
+
- ✅ You prefer HuggingFace ecosystem integration
|
| 88 |
+
- ✅ You prefer managed infrastructure
|
| 89 |
+
|
| 90 |
+
### Modal
|
| 91 |
+
|
| 92 |
+
**What it is**: Serverless compute platform with pay-per-second billing for CPU and GPU workloads.
|
| 93 |
+
|
| 94 |
+
**Pricing Model**: Pay-per-second usage (no subscription required)
|
| 95 |
+
|
| 96 |
+
**Hardware Options**:
|
| 97 |
+
- `cpu` - Physical core (2 vCPU equivalent) ($0.0000131/core/sec, min 0.125 cores)
|
| 98 |
+
- `gpu_t4` - NVIDIA T4 16GB ($0.000164/sec ~= $0.59/hr)
|
| 99 |
+
- `gpu_l4` - NVIDIA L4 24GB ($0.000222/sec ~= $0.80/hr)
|
| 100 |
+
- `gpu_a10` - NVIDIA A10G 24GB ($0.000306/sec ~= $1.10/hr)
|
| 101 |
+
- `gpu_l40s` - NVIDIA L40S 48GB ($0.000542/sec ~= $1.95/hr)
|
| 102 |
+
- `gpu_a100` - NVIDIA A100 40GB ($0.000583/sec ~= $2.10/hr)
|
| 103 |
+
- `gpu_a100_80gb` - NVIDIA A100 80GB ($0.000694/sec ~= $2.50/hr)
|
| 104 |
+
- `gpu_h100` - NVIDIA H100 80GB ($0.001097/sec ~= $3.95/hr)
|
| 105 |
+
- `gpu_h200` - NVIDIA H200 141GB ($0.001261/sec ~= $4.54/hr)
|
| 106 |
+
- `gpu_b200` - NVIDIA B200 192GB ($0.001736/sec ~= $6.25/hr)
|
| 107 |
+
|
| 108 |
+
**Pros**:
|
| 109 |
+
- Pay-per-second (no hourly minimums)
|
| 110 |
+
- Wide range of GPUs (including H200, H100)
|
| 111 |
+
- No subscription required
|
| 112 |
+
- Real-time logs and monitoring
|
| 113 |
+
- Fast cold starts
|
| 114 |
+
|
| 115 |
+
**Cons**:
|
| 116 |
+
- Requires Modal account setup
|
| 117 |
+
- Need to configure API tokens (MODAL_TOKEN_ID + MODAL_TOKEN_SECRET)
|
| 118 |
+
- Network egress charges apply
|
| 119 |
+
- Less integrated with HF ecosystem
|
| 120 |
+
|
| 121 |
+
**When to use**:
|
| 122 |
+
- ✅ You want to minimize costs (generally cheaper than HF Jobs)
|
| 123 |
+
- ✅ You need access to latest GPUs (H200, H100, B200)
|
| 124 |
+
- ✅ You prefer serverless architecture
|
| 125 |
+
- ✅ You don't have HF Pro subscription
|
| 126 |
+
- ✅ You want more GPU options and flexibility
|
| 127 |
+
|
| 128 |
+
---
|
| 129 |
+
|
| 130 |
+
## Prerequisites
|
| 131 |
+
|
| 132 |
+
### For Viewing Leaderboard (Free)
|
| 133 |
+
|
| 134 |
+
**Required**:
|
| 135 |
+
- HuggingFace account (free)
|
| 136 |
+
- HuggingFace token with **Read** permissions
|
| 137 |
+
|
| 138 |
+
**How to get**:
|
| 139 |
+
1. Go to https://huggingface.co/settings/tokens
|
| 140 |
+
2. Create new token with **Read** permission
|
| 141 |
+
3. Copy token (starts with `hf_...`)
|
| 142 |
+
4. Add to TraceMind Settings tab
|
| 143 |
+
|
| 144 |
+
### For Submitting Jobs to HuggingFace Jobs
|
| 145 |
+
|
| 146 |
+
**Required**:
|
| 147 |
+
1. **HuggingFace Pro** subscription ($9/month)
|
| 148 |
+
- Sign up at https://huggingface.co/pricing
|
| 149 |
+
- **Must add credit card** for GPU compute charges
|
| 150 |
+
2. HuggingFace token with **Read + Write + Run Jobs** permissions
|
| 151 |
+
3. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
|
| 152 |
+
|
| 153 |
+
**How to setup**:
|
| 154 |
+
1. Subscribe to HF Pro: https://huggingface.co/pricing
|
| 155 |
+
2. Add credit card for compute charges
|
| 156 |
+
3. Create token with all permissions:
|
| 157 |
+
- Go to https://huggingface.co/settings/tokens
|
| 158 |
+
- Click "New token"
|
| 159 |
+
- Select: **Read**, **Write**, **Run Jobs**
|
| 160 |
+
- Copy token
|
| 161 |
+
4. Add API keys in TraceMind Settings:
|
| 162 |
+
- HuggingFace Token
|
| 163 |
+
- OpenAI API Key (if testing OpenAI models)
|
| 164 |
+
- Anthropic API Key (if testing Claude models)
|
| 165 |
+
- etc.
|
| 166 |
+
|
| 167 |
+
### For Submitting Jobs to Modal
|
| 168 |
+
|
| 169 |
+
**Required**:
|
| 170 |
+
1. Modal account (free to create, pay-per-use)
|
| 171 |
+
2. Modal API token (Token ID + Token Secret)
|
| 172 |
+
3. HuggingFace token with **Read + Write** permissions
|
| 173 |
+
4. LLM provider API keys (OpenAI, Anthropic, etc.) for API models
|
| 174 |
+
|
| 175 |
+
**How to setup**:
|
| 176 |
+
1. Create Modal account:
|
| 177 |
+
- Go to https://modal.com
|
| 178 |
+
- Sign up (GitHub or email)
|
| 179 |
+
2. Create API token:
|
| 180 |
+
- Go to https://modal.com/settings/tokens
|
| 181 |
+
- Click "Create token"
|
| 182 |
+
- Copy **Token ID** (starts with `ak-...`)
|
| 183 |
+
- Copy **Token Secret** (starts with `as-...`)
|
| 184 |
+
3. Add credentials in TraceMind Settings:
|
| 185 |
+
- Modal Token ID
|
| 186 |
+
- Modal Token Secret
|
| 187 |
+
- HuggingFace Token (Read + Write)
|
| 188 |
+
- LLM provider API keys
|
| 189 |
+
|
| 190 |
+
---
|
| 191 |
+
|
| 192 |
+
## Hardware Selection Guide
|
| 193 |
+
|
| 194 |
+
### Auto-Selection (Recommended)
|
| 195 |
+
|
| 196 |
+
Set hardware to **`auto`** to let TraceMind automatically select the optimal hardware based on:
|
| 197 |
+
- Model size (extracted from model name)
|
| 198 |
+
- Provider type (API vs local)
|
| 199 |
+
- Infrastructure (HF Jobs vs Modal)
|
| 200 |
+
|
| 201 |
+
**Auto-selection logic**:
|
| 202 |
+
|
| 203 |
+
**For API Models** (provider = `litellm` or `inference`):
|
| 204 |
+
- Always uses **CPU** (no GPU needed)
|
| 205 |
+
- HF Jobs: `cpu-basic`
|
| 206 |
+
- Modal: `cpu`
|
| 207 |
+
|
| 208 |
+
**For Local Models** (provider = `transformers`):
|
| 209 |
+
|
| 210 |
+
*Memory estimation for agentic workloads*:
|
| 211 |
+
- Model weights (FP16): ~2GB per 1B params
|
| 212 |
+
- KV cache for long contexts: ~1.5-2x model size
|
| 213 |
+
- Inference overhead: ~20-30% additional
|
| 214 |
+
- **Total: ~4-5GB per 1B params for safe execution**
|
| 215 |
+
|
| 216 |
+
**HuggingFace Jobs**:
|
| 217 |
+
| Model Size | Hardware | VRAM | Example Models |
|
| 218 |
+
|------------|----------|------|----------------|
|
| 219 |
+
| < 1B | `t4-small` | 16GB | Qwen-0.5B, Phi-3-mini |
|
| 220 |
+
| 1B - 5B | `t4-small` | 16GB | Llama-3.2-3B, Gemma-2B |
|
| 221 |
+
| 6B - 12B | `a10g-large` | 24GB | Llama-3.1-8B, Mistral-7B |
|
| 222 |
+
| 13B+ | `a100-large` | 80GB | Llama-3.1-70B, Qwen-14B |
|
| 223 |
+
|
| 224 |
+
**Modal**:
|
| 225 |
+
| Model Size | Hardware | VRAM | Example Models |
|
| 226 |
+
|------------|----------|------|----------------|
|
| 227 |
+
| < 1B | `gpu_t4` | 16GB | Qwen-0.5B, Phi-3-mini |
|
| 228 |
+
| 1B - 5B | `gpu_t4` | 16GB | Llama-3.2-3B, Gemma-2B |
|
| 229 |
+
| 6B - 12B | `gpu_l40s` | 48GB | Llama-3.1-8B, Mistral-7B |
|
| 230 |
+
| 13B - 24B | `gpu_a100_80gb` | 80GB | Llama-2-13B, Qwen-14B |
|
| 231 |
+
| 25B - 48B | `gpu_a100_80gb` | 80GB | Gemma-27B, Yi-34B |
|
| 232 |
+
| 49B+ | `gpu_h200` | 141GB | Llama-3.1-70B, Qwen-72B |
|
| 233 |
+
|
| 234 |
+
### Manual Selection
|
| 235 |
+
|
| 236 |
+
If you know your model's requirements, you can manually select hardware:
|
| 237 |
+
|
| 238 |
+
**CPU Jobs** (API models like GPT-4, Claude):
|
| 239 |
+
- HF Jobs: `cpu-basic` or `cpu-upgrade`
|
| 240 |
+
- Modal: `cpu`
|
| 241 |
+
|
| 242 |
+
**Small Models** (1B-5B params):
|
| 243 |
+
- HF Jobs: `t4-small` (16GB VRAM)
|
| 244 |
+
- Modal: `gpu_t4` (16GB VRAM)
|
| 245 |
+
- Examples: Llama-3.2-3B, Gemma-2B, Qwen-2.5-3B
|
| 246 |
+
|
| 247 |
+
**Medium Models** (6B-12B params):
|
| 248 |
+
- HF Jobs: `a10g-small` or `a10g-large` (24GB VRAM)
|
| 249 |
+
- Modal: `gpu_l40s` (48GB VRAM)
|
| 250 |
+
- Examples: Llama-3.1-8B, Mistral-7B, Qwen-2.5-7B
|
| 251 |
+
|
| 252 |
+
**Large Models** (13B-24B params):
|
| 253 |
+
- HF Jobs: `a100-large` (80GB VRAM)
|
| 254 |
+
- Modal: `gpu_a100_80gb` (80GB VRAM)
|
| 255 |
+
- Examples: Llama-2-13B, Qwen-14B, Mistral-22B
|
| 256 |
+
|
| 257 |
+
**Very Large Models** (25B+ params):
|
| 258 |
+
- HF Jobs: `a100-large` (80GB VRAM) - may need quantization
|
| 259 |
+
- Modal: `gpu_h200` (141GB VRAM) - recommended
|
| 260 |
+
- Examples: Llama-3.1-70B, Qwen-72B, Gemma-27B
|
| 261 |
+
|
| 262 |
+
**Cost vs Performance Trade-offs**:
|
| 263 |
+
- T4: Cheapest GPU, good for small models
|
| 264 |
+
- L4: Newer architecture, better performance than T4
|
| 265 |
+
- A10G: Good balance of cost/performance for medium models
|
| 266 |
+
- L40S: Best for 7B-12B models (Modal only)
|
| 267 |
+
- A100: Industry standard for large models
|
| 268 |
+
- H200: Latest GPU, massive VRAM (141GB), best for 70B+ models
|
| 269 |
+
|
| 270 |
+
---
|
| 271 |
+
|
| 272 |
+
## Submitting a Job
|
| 273 |
+
|
| 274 |
+
### Step 1: Navigate to New Evaluation Screen
|
| 275 |
+
|
| 276 |
+
1. Open TraceMind-AI
|
| 277 |
+
2. Click **▶️ New Evaluation** in the sidebar
|
| 278 |
+
3. You'll see a comprehensive configuration form
|
| 279 |
+
|
| 280 |
+
### Step 2: Configure Infrastructure
|
| 281 |
+
|
| 282 |
+
**Infrastructure Provider**:
|
| 283 |
+
- Choose `HuggingFace Jobs` or `Modal`
|
| 284 |
+
|
| 285 |
+
**Hardware**:
|
| 286 |
+
- Use `auto` (recommended) or select specific hardware
|
| 287 |
+
- See [Hardware Selection Guide](#hardware-selection-guide)
|
| 288 |
+
|
| 289 |
+
### Step 3: Configure Model
|
| 290 |
+
|
| 291 |
+
**Model**:
|
| 292 |
+
- Enter model ID (e.g., `openai/gpt-4`, `meta-llama/Llama-3.1-8B-Instruct`)
|
| 293 |
+
- Use HuggingFace format: `organization/model-name`
|
| 294 |
+
|
| 295 |
+
**Provider**:
|
| 296 |
+
- `litellm` - For API models (OpenAI, Anthropic, etc.)
|
| 297 |
+
- `inference` - For HuggingFace Inference API
|
| 298 |
+
- `transformers` - For local models loaded with transformers
|
| 299 |
+
|
| 300 |
+
**HF Inference Provider** (optional):
|
| 301 |
+
- Leave empty unless using HF Inference API
|
| 302 |
+
- Example: `openai-community/gpt2` for HF-hosted models
|
| 303 |
+
|
| 304 |
+
**HuggingFace Token** (optional):
|
| 305 |
+
- Leave empty if already configured in Settings
|
| 306 |
+
- Only needed for private models
|
| 307 |
+
|
| 308 |
+
### Step 4: Configure Agent
|
| 309 |
+
|
| 310 |
+
**Agent Type**:
|
| 311 |
+
- `tool` - Function calling agents only
|
| 312 |
+
- `code` - Code execution agents only
|
| 313 |
+
- `both` - Hybrid agents (recommended)
|
| 314 |
+
|
| 315 |
+
**Search Provider**:
|
| 316 |
+
- `duckduckgo` - Free, no API key required (recommended)
|
| 317 |
+
- `serper` - Requires Serper API key
|
| 318 |
+
- `brave` - Requires Brave Search API key
|
| 319 |
+
|
| 320 |
+
**Enable Optional Tools**:
|
| 321 |
+
- Select additional tools for the agent:
|
| 322 |
+
- `google_search` - Google Search (requires API key)
|
| 323 |
+
- `duckduckgo_search` - DuckDuckGo Search
|
| 324 |
+
- `visit_webpage` - Web page scraping
|
| 325 |
+
- `python_interpreter` - Python code execution
|
| 326 |
+
- `wikipedia_search` - Wikipedia queries
|
| 327 |
+
- `user_input` - User interaction (not recommended for batch eval)
|
| 328 |
+
|
| 329 |
+
### Step 5: Configure Test Dataset
|
| 330 |
+
|
| 331 |
+
**Dataset Name**:
|
| 332 |
+
- Default: `kshitijthakkar/smoltrace-tasks`
|
| 333 |
+
- Or use your own HuggingFace dataset
|
| 334 |
+
- Format: `username/dataset-name`
|
| 335 |
+
|
| 336 |
+
**Dataset Split**:
|
| 337 |
+
- Default: `train`
|
| 338 |
+
- Other options: `test`, `validation`
|
| 339 |
+
|
| 340 |
+
**Difficulty Filter**:
|
| 341 |
+
- `all` - All difficulty levels (recommended)
|
| 342 |
+
- `easy` - Easy tasks only
|
| 343 |
+
- `medium` - Medium tasks only
|
| 344 |
+
- `hard` - Hard tasks only
|
| 345 |
+
|
| 346 |
+
**Parallel Workers**:
|
| 347 |
+
- Default: `1` (sequential execution)
|
| 348 |
+
- Higher values (2-10) for faster execution
|
| 349 |
+
- ⚠️ Increases memory usage and API rate limits
|
| 350 |
+
|
| 351 |
+
### Step 6: Configure Output & Monitoring
|
| 352 |
+
|
| 353 |
+
**Output Format**:
|
| 354 |
+
- `hub` - Push to HuggingFace datasets (recommended)
|
| 355 |
+
- `json` - Save locally (requires output directory)
|
| 356 |
+
|
| 357 |
+
**Output Directory**:
|
| 358 |
+
- Only for `json` format
|
| 359 |
+
- Example: `./evaluation_results`
|
| 360 |
+
|
| 361 |
+
**Enable OpenTelemetry Tracing**:
|
| 362 |
+
- ✅ Recommended - Collects detailed execution traces
|
| 363 |
+
- Traces appear in TraceMind trace visualization
|
| 364 |
+
|
| 365 |
+
**Enable GPU Metrics**:
|
| 366 |
+
- ✅ Recommended for GPU jobs
|
| 367 |
+
- Collects GPU utilization, memory, temperature, CO2 emissions
|
| 368 |
+
- No effect on CPU jobs
|
| 369 |
+
|
| 370 |
+
**Private Datasets**:
|
| 371 |
+
- ☐ Make result datasets private on HuggingFace
|
| 372 |
+
- Default: Public datasets
|
| 373 |
+
|
| 374 |
+
**Debug Mode**:
|
| 375 |
+
- ☐ Enable verbose logging for troubleshooting
|
| 376 |
+
- Default: Off
|
| 377 |
+
|
| 378 |
+
**Quiet Mode**:
|
| 379 |
+
- ☐ Reduce output verbosity
|
| 380 |
+
- Default: Off
|
| 381 |
+
|
| 382 |
+
**Run ID** (optional):
|
| 383 |
+
- Auto-generated UUID if left empty
|
| 384 |
+
- Custom ID for tracking specific runs
|
| 385 |
+
|
| 386 |
+
**Job Timeout**:
|
| 387 |
+
- Default: `1h` (1 hour)
|
| 388 |
+
- Other examples: `30m`, `2h`, `3h`
|
| 389 |
+
- Job will be terminated if it exceeds timeout
|
| 390 |
+
|
| 391 |
+
### Step 7: Estimate Cost (Optional but Recommended)
|
| 392 |
+
|
| 393 |
+
1. Click **💰 Estimate Cost** button
|
| 394 |
+
2. Wait for AI-powered cost analysis
|
| 395 |
+
3. Review:
|
| 396 |
+
- Estimated total cost
|
| 397 |
+
- Estimated duration
|
| 398 |
+
- Hardware selection (if auto)
|
| 399 |
+
- Historical data (if available)
|
| 400 |
+
|
| 401 |
+
**Cost Estimation Sources**:
|
| 402 |
+
- **Historical Data**: Based on previous runs of the same model in leaderboard
|
| 403 |
+
- **MCP AI Analysis**: AI-powered estimation using Gemini 2.5 Flash (if no historical data)
|
| 404 |
+
|
| 405 |
+
### Step 8: Submit Job
|
| 406 |
+
|
| 407 |
+
1. Review all configurations
|
| 408 |
+
2. Click **🚀 Submit Evaluation** button
|
| 409 |
+
3. Wait for confirmation message
|
| 410 |
+
4. Copy job ID for tracking
|
| 411 |
+
|
| 412 |
+
**Confirmation message includes**:
|
| 413 |
+
- ✅ Job submission status
|
| 414 |
+
- Job ID and platform-specific ID
|
| 415 |
+
- Hardware selected
|
| 416 |
+
- Estimated duration
|
| 417 |
+
- Monitoring instructions
|
| 418 |
+
|
| 419 |
+
### Example: Submit HuggingFace Jobs Evaluation
|
| 420 |
+
|
| 421 |
+
```
|
| 422 |
+
Infrastructure: HuggingFace Jobs
|
| 423 |
+
Hardware: auto → a10g-large
|
| 424 |
+
Model: meta-llama/Llama-3.1-8B-Instruct
|
| 425 |
+
Provider: transformers
|
| 426 |
+
Agent Type: both
|
| 427 |
+
Dataset: kshitijthakkar/smoltrace-tasks
|
| 428 |
+
Output Format: hub
|
| 429 |
+
|
| 430 |
+
Click "Estimate Cost":
|
| 431 |
+
→ Estimated Cost: $1.25
|
| 432 |
+
→ Duration: 25 minutes
|
| 433 |
+
→ Hardware: a10g-large (auto-selected)
|
| 434 |
+
|
| 435 |
+
Click "Submit Evaluation":
|
| 436 |
+
→ ✅ Job submitted successfully!
|
| 437 |
+
→ HF Job ID: username/job_abc123
|
| 438 |
+
→ Monitor at: https://huggingface.co/jobs
|
| 439 |
+
```
|
| 440 |
+
|
| 441 |
+
### Example: Submit Modal Evaluation
|
| 442 |
+
|
| 443 |
+
```
|
| 444 |
+
Infrastructure: Modal
|
| 445 |
+
Hardware: auto → L40S
|
| 446 |
+
Model: meta-llama/Llama-3.1-8B-Instruct
|
| 447 |
+
Provider: transformers
|
| 448 |
+
Agent Type: both
|
| 449 |
+
Dataset: kshitijthakkar/smoltrace-tasks
|
| 450 |
+
Output Format: hub
|
| 451 |
+
|
| 452 |
+
Click "Estimate Cost":
|
| 453 |
+
→ Estimated Cost: $0.95
|
| 454 |
+
→ Duration: 20 minutes
|
| 455 |
+
→ Hardware: gpu_l40s (auto-selected)
|
| 456 |
+
|
| 457 |
+
Click "Submit Evaluation":
|
| 458 |
+
→ ✅ Job submitted successfully!
|
| 459 |
+
→ Modal Call ID: modal-job_xyz789
|
| 460 |
+
→ Monitor at: https://modal.com/apps
|
| 461 |
+
```
|
| 462 |
+
|
| 463 |
+
---
|
| 464 |
+
|
| 465 |
+
## Cost Estimation
|
| 466 |
+
|
| 467 |
+
### Understanding Cost Estimates
|
| 468 |
+
|
| 469 |
+
TraceMind provides AI-powered cost estimation before you submit jobs:
|
| 470 |
+
|
| 471 |
+
**Historical Data** (most accurate):
|
| 472 |
+
- Based on actual runs of the same model
|
| 473 |
+
- Shows average cost, duration from past evaluations
|
| 474 |
+
- Displays number of historical runs used
|
| 475 |
+
|
| 476 |
+
**MCP AI Analysis** (when no historical data):
|
| 477 |
+
- Powered by Google Gemini 2.5 Flash
|
| 478 |
+
- Analyzes model size, hardware, provider
|
| 479 |
+
- Estimates cost based on typical usage patterns
|
| 480 |
+
- Includes detailed breakdown and recommendations
|
| 481 |
+
|
| 482 |
+
### Cost Factors
|
| 483 |
+
|
| 484 |
+
**For HuggingFace Jobs**:
|
| 485 |
+
1. **Hardware per-second rate** (see [Infrastructure Options](#huggingface-jobs))
|
| 486 |
+
2. **Evaluation duration** (actual runtime only, billed per-second)
|
| 487 |
+
3. **LLM API costs** (if using API models like GPT-4)
|
| 488 |
+
4. **HF Pro subscription** ($9/month required)
|
| 489 |
+
|
| 490 |
+
**For Modal**:
|
| 491 |
+
1. **Hardware per-second rate** (no minimums)
|
| 492 |
+
2. **Evaluation duration** (actual runtime only)
|
| 493 |
+
3. **Network egress** (data transfer out)
|
| 494 |
+
4. **LLM API costs** (if using API models)
|
| 495 |
+
|
| 496 |
+
### Cost Optimization Tips
|
| 497 |
+
|
| 498 |
+
**Use Auto Hardware Selection**:
|
| 499 |
+
- Automatically picks cheapest hardware for your model
|
| 500 |
+
- Avoids over-provisioning (e.g., H200 for 3B model)
|
| 501 |
+
|
| 502 |
+
**Choose Right Infrastructure**:
|
| 503 |
+
- **If you have HF Pro**: Use HF Jobs (already paying subscription)
|
| 504 |
+
- **If you don't have HF Pro**: Use Modal (no subscription required)
|
| 505 |
+
- **For latest GPUs (H200/H100)**: Use Modal (HF Jobs doesn't offer these)
|
| 506 |
+
|
| 507 |
+
**Optimize Model Selection**:
|
| 508 |
+
- Smaller models (3B-7B) are 10x cheaper than large models (70B)
|
| 509 |
+
- API models (GPT-4-mini) often cheaper than local 70B models
|
| 510 |
+
|
| 511 |
+
**Reduce Test Count**:
|
| 512 |
+
- Use difficulty filter (`easy` only) for quick validation
|
| 513 |
+
- Test with small dataset first, then scale up
|
| 514 |
+
|
| 515 |
+
**Parallel Workers**:
|
| 516 |
+
- Keep at 1 for sequential execution (cheapest)
|
| 517 |
+
- Increase only if time is critical (increases API costs)
|
| 518 |
+
|
| 519 |
+
**Example Cost Comparison**:
|
| 520 |
+
| Model | Hardware | Infrastructure | Duration | HF Jobs Cost | Modal Cost |
|
| 521 |
+
|-------|----------|----------------|----------|--------------|------------|
|
| 522 |
+
| GPT-4 (API) | CPU | Either | 5 min | Free* | ~$0.00* |
|
| 523 |
+
| Llama-3.1-8B | A10G-large | HF Jobs | 25 min | $0.63** | N/A |
|
| 524 |
+
| Llama-3.1-8B | L40S | Modal | 20 min | N/A | $0.65** |
|
| 525 |
+
| Llama-3.1-70B | A100-80GB | Both | 45 min | $1.74** | $1.56** |
|
| 526 |
+
| Llama-3.1-70B | H200 | Modal only | 35 min | N/A | $2.65** |
|
| 527 |
+
|
| 528 |
+
\* Plus LLM API costs (OpenAI/Anthropic/etc. - not included)
|
| 529 |
+
\** Per-second billing, actual runtime only (no minimums)
|
| 530 |
+
|
| 531 |
+
---
|
| 532 |
+
|
| 533 |
+
## Monitoring Jobs
|
| 534 |
+
|
| 535 |
+
### HuggingFace Jobs
|
| 536 |
+
|
| 537 |
+
**Via HuggingFace Dashboard**:
|
| 538 |
+
1. Go to https://huggingface.co/jobs
|
| 539 |
+
2. Find your job in the list
|
| 540 |
+
3. Click to view details and logs
|
| 541 |
+
|
| 542 |
+
**Via TraceMind Job Monitoring Tab**:
|
| 543 |
+
1. Click **📈 Job Monitoring** in sidebar
|
| 544 |
+
2. See all your submitted jobs
|
| 545 |
+
3. Real-time status updates
|
| 546 |
+
4. Click job to view logs
|
| 547 |
+
|
| 548 |
+
**Job Statuses**:
|
| 549 |
+
- `pending` - Waiting for resources
|
| 550 |
+
- `running` - Currently executing
|
| 551 |
+
- `completed` - Finished successfully
|
| 552 |
+
- `failed` - Error occurred (check logs)
|
| 553 |
+
- `cancelled` - Manually stopped
|
| 554 |
+
|
| 555 |
+
### Modal
|
| 556 |
+
|
| 557 |
+
**Via Modal Dashboard**:
|
| 558 |
+
1. Go to https://modal.com/apps
|
| 559 |
+
2. Find your app: `smoltrace-eval-{job_id}`
|
| 560 |
+
3. Click to view real-time logs and metrics
|
| 561 |
+
|
| 562 |
+
**Via TraceMind Job Monitoring Tab**:
|
| 563 |
+
1. Click **📈 Job Monitoring** in sidebar
|
| 564 |
+
2. See all your submitted jobs
|
| 565 |
+
3. Modal jobs show as `submitted` (check Modal dashboard for details)
|
| 566 |
+
|
| 567 |
+
### Viewing Job Logs
|
| 568 |
+
|
| 569 |
+
**HuggingFace Jobs**:
|
| 570 |
+
```
|
| 571 |
+
1. Go to Job Monitoring tab
|
| 572 |
+
2. Click on your job
|
| 573 |
+
3. Click "View Logs" button
|
| 574 |
+
4. See real-time output from SMOLTRACE
|
| 575 |
+
```
|
| 576 |
+
|
| 577 |
+
**Modal**:
|
| 578 |
+
```
|
| 579 |
+
1. Go to https://modal.com/apps
|
| 580 |
+
2. Find your app
|
| 581 |
+
3. Click "Logs" tab
|
| 582 |
+
4. See streaming output in real-time
|
| 583 |
+
```
|
| 584 |
+
|
| 585 |
+
### Expected Job Duration
|
| 586 |
+
|
| 587 |
+
**API Models** (litellm provider):
|
| 588 |
+
- CPU job: 2-5 minutes for 100 tests
|
| 589 |
+
- No model download required
|
| 590 |
+
- Depends on API rate limits
|
| 591 |
+
|
| 592 |
+
**Local Models** (transformers provider):
|
| 593 |
+
- Model download: 5-15 minutes (one-time per job)
|
| 594 |
+
- 3B model: ~6GB download
|
| 595 |
+
- 8B model: ~16GB download
|
| 596 |
+
- 70B model: ~140GB download
|
| 597 |
+
- Evaluation: 10-30 minutes for 100 tests
|
| 598 |
+
- Total: 15-45 minutes typical
|
| 599 |
+
|
| 600 |
+
**Progress Indicators**:
|
| 601 |
+
1. ⏳ Job queued (0-2 minutes)
|
| 602 |
+
2. 🔄 Downloading model (5-15 minutes for first run)
|
| 603 |
+
3. 🧪 Running evaluation (10-30 minutes)
|
| 604 |
+
4. 📤 Uploading results to HuggingFace (1-2 minutes)
|
| 605 |
+
5. ✅ Complete
|
| 606 |
+
|
| 607 |
+
---
|
| 608 |
+
|
| 609 |
+
## Understanding Job Results
|
| 610 |
+
|
| 611 |
+
### Where Results Are Stored
|
| 612 |
+
|
| 613 |
+
**HuggingFace Datasets** (if output_format = "hub"):
|
| 614 |
+
|
| 615 |
+
SMOLTRACE creates 4 datasets for each evaluation:
|
| 616 |
+
|
| 617 |
+
1. **Leaderboard Dataset**: `huggingface/smolagents-leaderboard`
|
| 618 |
+
- Aggregate statistics for the run
|
| 619 |
+
- Appears in TraceMind Leaderboard tab
|
| 620 |
+
- Public, shared across all users
|
| 621 |
+
|
| 622 |
+
2. **Results Dataset**: `{your_username}/agent-results-{model}-{timestamp}`
|
| 623 |
+
- Individual test case results
|
| 624 |
+
- Success/failure, execution time, tokens, cost
|
| 625 |
+
- Links to traces dataset
|
| 626 |
+
|
| 627 |
+
3. **Traces Dataset**: `{your_username}/agent-traces-{model}-{timestamp}`
|
| 628 |
+
- OpenTelemetry traces (if enable_otel = True)
|
| 629 |
+
- Detailed execution steps, LLM calls, tool usage
|
| 630 |
+
- Viewable in TraceMind Trace Visualization
|
| 631 |
+
|
| 632 |
+
4. **Metrics Dataset**: `{your_username}/agent-metrics-{model}-{timestamp}`
|
| 633 |
+
- GPU metrics (if enable_gpu_metrics = True)
|
| 634 |
+
- GPU utilization, memory, temperature, CO2 emissions
|
| 635 |
+
- Time-series data for each test
|
| 636 |
+
|
| 637 |
+
**Local JSON Files** (if output_format = "json"):
|
| 638 |
+
- Saved to `output_dir` on the job machine
|
| 639 |
+
- Not automatically uploaded to HuggingFace
|
| 640 |
+
- Useful for local testing
|
| 641 |
+
|
| 642 |
+
### Viewing Results in TraceMind
|
| 643 |
+
|
| 644 |
+
**Step 1: Refresh Leaderboard**
|
| 645 |
+
1. Go to **📊 Leaderboard** tab
|
| 646 |
+
2. Click **Load Leaderboard** button
|
| 647 |
+
3. Your new run appears in the table
|
| 648 |
+
|
| 649 |
+
**Step 2: View Run Details**
|
| 650 |
+
1. Click on your run in the leaderboard
|
| 651 |
+
2. See detailed test results:
|
| 652 |
+
- Individual test cases
|
| 653 |
+
- Success/failure breakdown
|
| 654 |
+
- Execution times
|
| 655 |
+
- Token usage
|
| 656 |
+
- Costs
|
| 657 |
+
|
| 658 |
+
**Step 3: Visualize Traces** (if enable_otel = True)
|
| 659 |
+
1. From run details, click on a test case
|
| 660 |
+
2. Click **View Trace** button
|
| 661 |
+
3. See OpenTelemetry waterfall diagram
|
| 662 |
+
4. Analyze:
|
| 663 |
+
- LLM calls and durations
|
| 664 |
+
- Tool executions
|
| 665 |
+
- Reasoning steps
|
| 666 |
+
- GPU metrics overlay (if GPU job)
|
| 667 |
+
|
| 668 |
+
**Step 4: Ask Questions About Results**
|
| 669 |
+
1. Go to **🤖 Agent Chat** tab
|
| 670 |
+
2. Ask questions like:
|
| 671 |
+
- "Analyze my latest evaluation run"
|
| 672 |
+
- "Why did test case 5 fail?"
|
| 673 |
+
- "Compare my run with the top model"
|
| 674 |
+
- "What was the cost breakdown?"
|
| 675 |
+
|
| 676 |
+
### Interpreting Results
|
| 677 |
+
|
| 678 |
+
**Key Metrics**:
|
| 679 |
+
|
| 680 |
+
| Metric | Description | Good Value |
|
| 681 |
+
|--------|-------------|------------|
|
| 682 |
+
| **Success Rate** | % of tests passed | >90% excellent, >70% good |
|
| 683 |
+
| **Avg Duration** | Time per test case | <5s good, <10s acceptable |
|
| 684 |
+
| **Total Cost** | Cost for all tests | Varies by model |
|
| 685 |
+
| **Tokens Used** | Total tokens consumed | Lower is better |
|
| 686 |
+
| **CO2 Emissions** | Carbon footprint | Lower is better |
|
| 687 |
+
| **GPU Utilization** | GPU usage % | >60% efficient |
|
| 688 |
+
|
| 689 |
+
**Common Patterns**:
|
| 690 |
+
|
| 691 |
+
**High accuracy, low cost**:
|
| 692 |
+
- ✅ Excellent model for production
|
| 693 |
+
- Examples: GPT-4-mini, Claude-3-Haiku, Gemini-1.5-Flash
|
| 694 |
+
|
| 695 |
+
**High accuracy, high cost**:
|
| 696 |
+
- ✅ Best for quality-critical tasks
|
| 697 |
+
- Examples: GPT-4, Claude-3.5-Sonnet, Gemini-1.5-Pro
|
| 698 |
+
|
| 699 |
+
**Low accuracy, low cost**:
|
| 700 |
+
- ⚠️ May need prompt optimization or better model
|
| 701 |
+
- Examples: Small local models (<3B params)
|
| 702 |
+
|
| 703 |
+
**Low accuracy, high cost**:
|
| 704 |
+
- ❌ Poor choice, investigate or switch models
|
| 705 |
+
- May indicate configuration issues
|
| 706 |
+
|
| 707 |
+
---
|
| 708 |
+
|
| 709 |
+
## Troubleshooting
|
| 710 |
+
|
| 711 |
+
### Job Submission Failures
|
| 712 |
+
|
| 713 |
+
**Error: "HuggingFace token not configured"**
|
| 714 |
+
- **Cause**: Missing or invalid HF token
|
| 715 |
+
- **Fix**:
|
| 716 |
+
1. Go to Settings tab
|
| 717 |
+
2. Add HF token with "Read + Write + Run Jobs" permissions
|
| 718 |
+
3. Click "Save API Keys"
|
| 719 |
+
|
| 720 |
+
**Error: "HuggingFace Pro subscription required"**
|
| 721 |
+
- **Cause**: HF Jobs requires Pro subscription
|
| 722 |
+
- **Fix**:
|
| 723 |
+
1. Subscribe at https://huggingface.co/pricing ($9/month)
|
| 724 |
+
2. Add credit card for GPU charges
|
| 725 |
+
3. Try again
|
| 726 |
+
|
| 727 |
+
**Error: "Modal credentials not configured"**
|
| 728 |
+
- **Cause**: Missing Modal API tokens
|
| 729 |
+
- **Fix**:
|
| 730 |
+
1. Go to https://modal.com/settings/tokens
|
| 731 |
+
2. Create new token
|
| 732 |
+
3. Copy Token ID and Token Secret
|
| 733 |
+
4. Add to Settings tab
|
| 734 |
+
5. Try again
|
| 735 |
+
|
| 736 |
+
**Error: "Modal package not installed"**
|
| 737 |
+
- **Cause**: Modal SDK missing (should not happen in hosted Space)
|
| 738 |
+
- **Fix**: Contact support or run locally with `pip install modal`
|
| 739 |
+
|
| 740 |
+
### Job Execution Failures
|
| 741 |
+
|
| 742 |
+
**Job stuck in "Pending" status**
|
| 743 |
+
- **Cause**: High demand for GPU resources
|
| 744 |
+
- **Fix**:
|
| 745 |
+
- Wait 5-10 minutes
|
| 746 |
+
- Try different hardware (e.g., T4 instead of A100)
|
| 747 |
+
- Try different infrastructure (Modal vs HF Jobs)
|
| 748 |
+
|
| 749 |
+
**Job fails with "Out of Memory"**
|
| 750 |
+
- **Cause**: Model too large for selected hardware
|
| 751 |
+
- **Fix**:
|
| 752 |
+
- Use larger GPU (A100-80GB or H200)
|
| 753 |
+
- Or use `auto` hardware selection
|
| 754 |
+
- Or reduce `parallel_workers` to 1
|
| 755 |
+
|
| 756 |
+
**Job fails with "Model not found"**
|
| 757 |
+
- **Cause**: Invalid model ID or private model
|
| 758 |
+
- **Fix**:
|
| 759 |
+
- Check model ID format: `organization/model-name`
|
| 760 |
+
- For private models, add HF token with access
|
| 761 |
+
- Verify model exists on HuggingFace Hub
|
| 762 |
+
|
| 763 |
+
**Job fails with "API key not set"**
|
| 764 |
+
- **Cause**: Missing LLM provider API key
|
| 765 |
+
- **Fix**:
|
| 766 |
+
1. Go to Settings tab
|
| 767 |
+
2. Add API key for your provider (OpenAI, Anthropic, etc.)
|
| 768 |
+
3. Submit job again
|
| 769 |
+
|
| 770 |
+
**Job fails with "Rate limit exceeded"**
|
| 771 |
+
- **Cause**: Too many API requests
|
| 772 |
+
- **Fix**:
|
| 773 |
+
- Reduce `parallel_workers` to 1
|
| 774 |
+
- Use different model with higher rate limits
|
| 775 |
+
- Wait and retry later
|
| 776 |
+
|
| 777 |
+
**Modal job fails with "Authentication failed"**
|
| 778 |
+
- **Cause**: Invalid Modal tokens
|
| 779 |
+
- **Fix**:
|
| 780 |
+
1. Go to https://modal.com/settings/tokens
|
| 781 |
+
2. Create new token (old one may be expired)
|
| 782 |
+
3. Update tokens in Settings tab
|
| 783 |
+
|
| 784 |
+
### Results Not Appearing
|
| 785 |
+
|
| 786 |
+
**Results not in leaderboard after job completes**
|
| 787 |
+
- **Cause**: Dataset upload failed or not configured
|
| 788 |
+
- **Fix**:
|
| 789 |
+
- Check job logs for errors
|
| 790 |
+
- Verify `output_format` was set to "hub"
|
| 791 |
+
- Verify HF token has "Write" permission
|
| 792 |
+
- Manually refresh leaderboard (click "Load Leaderboard")
|
| 793 |
+
|
| 794 |
+
**Traces not appearing**
|
| 795 |
+
- **Cause**: OpenTelemetry not enabled
|
| 796 |
+
- **Fix**:
|
| 797 |
+
- Re-run evaluation with `enable_otel = True`
|
| 798 |
+
- Check traces dataset exists on your HF profile
|
| 799 |
+
|
| 800 |
+
**GPU metrics not showing**
|
| 801 |
+
- **Cause**: GPU metrics not enabled or CPU job
|
| 802 |
+
- **Fix**:
|
| 803 |
+
- Re-run with `enable_gpu_metrics = True`
|
| 804 |
+
- Verify job used GPU hardware (not CPU)
|
| 805 |
+
- Check metrics dataset exists
|
| 806 |
+
|
| 807 |
+
---
|
| 808 |
+
|
| 809 |
+
## Advanced Configuration
|
| 810 |
+
|
| 811 |
+
### Custom Test Datasets
|
| 812 |
+
|
| 813 |
+
**Create your own test dataset**:
|
| 814 |
+
|
| 815 |
+
1. Use **🔬 Synthetic Data Generator** tab:
|
| 816 |
+
- Configure domain and tools
|
| 817 |
+
- Generate custom tasks
|
| 818 |
+
- Push to HuggingFace Hub
|
| 819 |
+
|
| 820 |
+
2. Use generated dataset in evaluation:
|
| 821 |
+
- Set `dataset_name` to your dataset: `{username}/dataset-name`
|
| 822 |
+
- Configure agent with matching tools
|
| 823 |
+
|
| 824 |
+
**Dataset Format Requirements**:
|
| 825 |
+
```python
|
| 826 |
+
{
|
| 827 |
+
"task_id": "task_001",
|
| 828 |
+
"prompt": "What's the weather in Tokyo?",
|
| 829 |
+
"expected_tool": "get_weather",
|
| 830 |
+
"difficulty": "easy",
|
| 831 |
+
"category": "tool_usage"
|
| 832 |
+
}
|
| 833 |
+
```
|
| 834 |
+
|
| 835 |
+
### Environment Variables
|
| 836 |
+
|
| 837 |
+
**LLM Provider API Keys** (in Settings):
|
| 838 |
+
- `OPENAI_API_KEY` - OpenAI API
|
| 839 |
+
- `ANTHROPIC_API_KEY` - Anthropic API
|
| 840 |
+
- `GOOGLE_API_KEY` or `GEMINI_API_KEY` - Google Gemini API
|
| 841 |
+
- `COHERE_API_KEY` - Cohere API
|
| 842 |
+
- `MISTRAL_API_KEY` - Mistral API
|
| 843 |
+
- `TOGETHER_API_KEY` - Together AI API
|
| 844 |
+
- `GROQ_API_KEY` - Groq API
|
| 845 |
+
- `REPLICATE_API_TOKEN` - Replicate API
|
| 846 |
+
- `ANYSCALE_API_KEY` - Anyscale API
|
| 847 |
+
|
| 848 |
+
**Infrastructure Credentials**:
|
| 849 |
+
- `HF_TOKEN` - HuggingFace token
|
| 850 |
+
- `MODAL_TOKEN_ID` - Modal token ID
|
| 851 |
+
- `MODAL_TOKEN_SECRET` - Modal token secret
|
| 852 |
+
|
| 853 |
+
### Parallel Execution
|
| 854 |
+
|
| 855 |
+
**Use `parallel_workers` to speed up evaluation**:
|
| 856 |
+
|
| 857 |
+
- `1` - Sequential execution (default, safest)
|
| 858 |
+
- `2-4` - Moderate parallelism (2-4x faster)
|
| 859 |
+
- `5-10` - High parallelism (5-10x faster, risky)
|
| 860 |
+
|
| 861 |
+
**Trade-offs**:
|
| 862 |
+
- ✅ **Faster**: Linear speedup with workers
|
| 863 |
+
- ⚠️ **Higher cost**: More API calls per minute
|
| 864 |
+
- ⚠️ **Rate limits**: May hit provider rate limits
|
| 865 |
+
- ⚠️ **Memory**: Increases GPU memory usage
|
| 866 |
+
|
| 867 |
+
**Recommendations**:
|
| 868 |
+
- API models: Keep at 1 (avoid rate limits)
|
| 869 |
+
- Local models: Can use 2-4 if GPU has enough VRAM
|
| 870 |
+
- Production runs: Use 1 for reliability
|
| 871 |
+
|
| 872 |
+
### Private Datasets
|
| 873 |
+
|
| 874 |
+
**Make results private**:
|
| 875 |
+
|
| 876 |
+
1. Set `private = True` in job configuration
|
| 877 |
+
2. Results will be private on your HuggingFace profile
|
| 878 |
+
3. Only you can view in leaderboard (if using private leaderboard dataset)
|
| 879 |
+
|
| 880 |
+
**Use cases**:
|
| 881 |
+
- Proprietary models
|
| 882 |
+
- Confidential evaluation data
|
| 883 |
+
- Internal benchmarking
|
| 884 |
+
|
| 885 |
+
---
|
| 886 |
+
|
| 887 |
+
## Quick Reference
|
| 888 |
+
|
| 889 |
+
### Job Submission Checklist
|
| 890 |
+
|
| 891 |
+
Before submitting a job, verify:
|
| 892 |
+
|
| 893 |
+
- [ ] Infrastructure selected (HF Jobs or Modal)
|
| 894 |
+
- [ ] Hardware configured (auto or manual)
|
| 895 |
+
- [ ] Model ID is correct
|
| 896 |
+
- [ ] Provider matches model type
|
| 897 |
+
- [ ] API keys configured in Settings
|
| 898 |
+
- [ ] Dataset name is valid
|
| 899 |
+
- [ ] Output format is "hub" for TraceMind integration
|
| 900 |
+
- [ ] OpenTelemetry tracing enabled (if you want traces)
|
| 901 |
+
- [ ] GPU metrics enabled (if using GPU)
|
| 902 |
+
- [ ] Cost estimate reviewed
|
| 903 |
+
- [ ] Timeout is sufficient for your model size
|
| 904 |
+
|
| 905 |
+
### Common Model Configurations
|
| 906 |
+
|
| 907 |
+
**OpenAI GPT-4**:
|
| 908 |
+
```
|
| 909 |
+
Model: openai/gpt-4
|
| 910 |
+
Provider: litellm
|
| 911 |
+
Hardware: auto → cpu-basic
|
| 912 |
+
Infrastructure: Either (HF Jobs or Modal)
|
| 913 |
+
Estimated Cost: API costs only
|
| 914 |
+
```
|
| 915 |
+
|
| 916 |
+
**Anthropic Claude-3.5-Sonnet**:
|
| 917 |
+
```
|
| 918 |
+
Model: anthropic/claude-3.5-sonnet
|
| 919 |
+
Provider: litellm
|
| 920 |
+
Hardware: auto → cpu-basic
|
| 921 |
+
Infrastructure: Either (HF Jobs or Modal)
|
| 922 |
+
Estimated Cost: API costs only
|
| 923 |
+
```
|
| 924 |
+
|
| 925 |
+
**Meta Llama-3.1-8B**:
|
| 926 |
+
```
|
| 927 |
+
Model: meta-llama/Llama-3.1-8B-Instruct
|
| 928 |
+
Provider: transformers
|
| 929 |
+
Hardware: auto → a10g-large (HF) or gpu_l40s (Modal)
|
| 930 |
+
Infrastructure: Modal (cheaper for short jobs)
|
| 931 |
+
Estimated Cost: $0.75-1.50
|
| 932 |
+
```
|
| 933 |
+
|
| 934 |
+
**Meta Llama-3.1-70B**:
|
| 935 |
+
```
|
| 936 |
+
Model: meta-llama/Llama-3.1-70B-Instruct
|
| 937 |
+
Provider: transformers
|
| 938 |
+
Hardware: auto → a100-large (HF) or gpu_h200 (Modal)
|
| 939 |
+
Infrastructure: Modal (if available), else HF Jobs
|
| 940 |
+
Estimated Cost: $3.00-8.00
|
| 941 |
+
```
|
| 942 |
+
|
| 943 |
+
**Qwen-2.5-Coder-32B**:
|
| 944 |
+
```
|
| 945 |
+
Model: Qwen/Qwen2.5-Coder-32B-Instruct
|
| 946 |
+
Provider: transformers
|
| 947 |
+
Hardware: auto → a100-large (HF) or gpu_a100_80gb (Modal)
|
| 948 |
+
Infrastructure: Either
|
| 949 |
+
Estimated Cost: $2.00-4.00
|
| 950 |
+
```
|
| 951 |
+
|
| 952 |
+
---
|
| 953 |
+
|
| 954 |
+
## Next Steps
|
| 955 |
+
|
| 956 |
+
After submitting your first job:
|
| 957 |
+
|
| 958 |
+
1. **Monitor progress** in Job Monitoring tab
|
| 959 |
+
2. **View results** in Leaderboard when complete
|
| 960 |
+
3. **Analyze traces** in Trace Visualization
|
| 961 |
+
4. **Ask questions** in Agent Chat about your results
|
| 962 |
+
5. **Compare** with other models using Compare feature
|
| 963 |
+
6. **Optimize** model selection based on cost/accuracy trade-offs
|
| 964 |
+
7. **Generate** custom test datasets for your domain
|
| 965 |
+
8. **Share** your results with the community
|
| 966 |
+
|
| 967 |
+
For more help:
|
| 968 |
+
- [USER_GUIDE.md](USER_GUIDE.md) - Complete screen-by-screen walkthrough
|
| 969 |
+
- [MCP_INTEGRATION.md](MCP_INTEGRATION.md) - MCP client architecture details
|
| 970 |
+
- [ARCHITECTURE.md](ARCHITECTURE.md) - Technical architecture overview
|
| 971 |
+
- GitHub Issues: [TraceMind-AI/issues](https://github.com/Mandark-droid/TraceMind-AI/issues)
|
|
@@ -327,9 +327,9 @@ To prevent rate limits during evaluation:
|
|
| 327 |
- **Agent Framework**: smolagents 1.22.0+
|
| 328 |
- **MCP Integration**: MCP Python SDK + smolagents MCPClient
|
| 329 |
- **Data Source**: HuggingFace Datasets API
|
| 330 |
-
- **Authentication**: HuggingFace OAuth
|
| 331 |
- **AI Models**:
|
| 332 |
-
- Agent:
|
| 333 |
- MCP Server: Google Gemini 2.5 Flash
|
| 334 |
- **Cloud Platforms**: HuggingFace Jobs + Modal
|
| 335 |
|
|
|
|
| 327 |
- **Agent Framework**: smolagents 1.22.0+
|
| 328 |
- **MCP Integration**: MCP Python SDK + smolagents MCPClient
|
| 329 |
- **Data Source**: HuggingFace Datasets API
|
| 330 |
+
- **Authentication**: HuggingFace OAuth (planned)
|
| 331 |
- **AI Models**:
|
| 332 |
+
- Agent: Google Gemini 2.5 Flash
|
| 333 |
- MCP Server: Google Gemini 2.5 Flash
|
| 334 |
- **Cloud Platforms**: HuggingFace Jobs + Modal
|
| 335 |
|