Spaces:
Running
Running
Commit
·
1e21c93
1
Parent(s):
cdba763
fix: Use ast.literal_eval() for MCP tool string returns and correct parameter names
Browse filesMCP tools return string representations of Python dicts (with single quotes),
not actual dict objects or JSON strings. This requires using ast.literal_eval()
to parse them safely.
Changes:
- Updated all examples to use ast.literal_eval() pattern
- Fixed parameter names: leaderboard_repo → repo
- Updated rule #4 to explain MCP tool return types
- Added defensive isinstance() checks to handle both strings and dicts
This fixes the TypeError: string indices must be integers error.
- prompts/code_agent.yaml +28 -13
prompts/code_agent.yaml
CHANGED
|
@@ -25,13 +25,15 @@ system_prompt: |-
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
-
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!).
|
| 29 |
```python
|
| 30 |
-
|
| 31 |
-
|
|
|
|
| 32 |
metric="success_rate",
|
| 33 |
top_n=3
|
| 34 |
)
|
|
|
|
| 35 |
print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
|
| 36 |
for model in top_models_data['top_performers']:
|
| 37 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
|
@@ -69,20 +71,23 @@ system_prompt: |-
|
|
| 69 |
---
|
| 70 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 71 |
|
| 72 |
-
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. MCP tools return
|
| 73 |
```python
|
|
|
|
| 74 |
# Get overview statistics
|
| 75 |
-
|
| 76 |
-
|
| 77 |
)
|
|
|
|
| 78 |
summary = summary_data['summary']
|
| 79 |
|
| 80 |
# Get top 5 performers
|
| 81 |
-
|
| 82 |
-
|
| 83 |
metric="success_rate",
|
| 84 |
top_n=5
|
| 85 |
)
|
|
|
|
| 86 |
top_models = top_models_data['top_performers']
|
| 87 |
|
| 88 |
print(f"Leaderboard Overview:")
|
|
@@ -124,15 +129,17 @@ system_prompt: |-
|
|
| 124 |
---
|
| 125 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 126 |
|
| 127 |
-
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty.
|
| 128 |
```python
|
| 129 |
-
|
|
|
|
| 130 |
domain="finance",
|
| 131 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 132 |
num_tasks=20,
|
| 133 |
difficulty_distribution="balanced",
|
| 134 |
agent_type="both"
|
| 135 |
)
|
|
|
|
| 136 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
|
| 137 |
print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
|
| 138 |
print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
|
|
@@ -164,17 +171,19 @@ system_prompt: |-
|
|
| 164 |
---
|
| 165 |
Task: "Generate 50 customer support tasks and upload them to HuggingFace as 'my-org/smoltrace-customer-support-tasks'"
|
| 166 |
|
| 167 |
-
Thought: I'll first generate the synthetic dataset with 50 tasks, then use run_push_dataset_to_hub to upload it to HuggingFace. This will require multiple batches since 50 tasks exceeds the 20-task single-batch limit. MCP tools return
|
| 168 |
```python
|
| 169 |
import json
|
|
|
|
| 170 |
# Step 1: Generate synthetic dataset
|
| 171 |
-
|
| 172 |
domain="customer_support",
|
| 173 |
tool_names="search_knowledge_base,create_ticket,send_email,check_order_status",
|
| 174 |
num_tasks=50,
|
| 175 |
difficulty_distribution="progressive",
|
| 176 |
agent_type="both"
|
| 177 |
)
|
|
|
|
| 178 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks in {synthetic_result['dataset_info']['num_batches']} batches")
|
| 179 |
|
| 180 |
# Step 2: Extract tasks array and convert to JSON string for push_dataset_to_hub
|
|
@@ -232,7 +241,13 @@ system_prompt: |-
|
|
| 232 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 233 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 234 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 235 |
-
- **IMPORTANT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 237 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 238 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|
|
|
|
| 25 |
---
|
| 26 |
Task: "What are the top 3 performing models on the leaderboard and how much do they cost?"
|
| 27 |
|
| 28 |
+
Thought: This is a "top N" query, so I should use the optimized `run_get_top_performers` tool instead of run_get_dataset to avoid loading all 51 runs (saves 90% tokens!). MCP tools return string representations of dicts, so I need to use eval() to parse them.
|
| 29 |
```python
|
| 30 |
+
import ast
|
| 31 |
+
top_models_raw = run_get_top_performers(
|
| 32 |
+
repo="kshitijthakkar/smoltrace-leaderboard",
|
| 33 |
metric="success_rate",
|
| 34 |
top_n=3
|
| 35 |
)
|
| 36 |
+
top_models_data = ast.literal_eval(top_models_raw) if isinstance(top_models_raw, str) else top_models_raw
|
| 37 |
print(f"Top 3 models by {top_models_data['metric_ranked_by']}:")
|
| 38 |
for model in top_models_data['top_performers']:
|
| 39 |
print(f" - {model['model']}: {model['success_rate']}% success, ${model['total_cost_usd']}/run")
|
|
|
|
| 71 |
---
|
| 72 |
Task: "Analyze the current leaderboard and show me the top performing models with their costs"
|
| 73 |
|
| 74 |
+
Thought: This is an overview question about the leaderboard. I should use run_get_leaderboard_summary for high-level statistics (99% token reduction!), then run_get_top_performers for the top models with costs. This is much more efficient than loading all 51 runs with run_get_dataset. MCP tools return string representations of dicts.
|
| 75 |
```python
|
| 76 |
+
import ast
|
| 77 |
# Get overview statistics
|
| 78 |
+
summary_raw = run_get_leaderboard_summary(
|
| 79 |
+
repo="kshitijthakkar/smoltrace-leaderboard"
|
| 80 |
)
|
| 81 |
+
summary_data = ast.literal_eval(summary_raw) if isinstance(summary_raw, str) else summary_raw
|
| 82 |
summary = summary_data['summary']
|
| 83 |
|
| 84 |
# Get top 5 performers
|
| 85 |
+
top_raw = run_get_top_performers(
|
| 86 |
+
repo="kshitijthakkar/smoltrace-leaderboard",
|
| 87 |
metric="success_rate",
|
| 88 |
top_n=5
|
| 89 |
)
|
| 90 |
+
top_models_data = ast.literal_eval(top_raw) if isinstance(top_raw, str) else top_raw
|
| 91 |
top_models = top_models_data['top_performers']
|
| 92 |
|
| 93 |
print(f"Leaderboard Overview:")
|
|
|
|
| 129 |
---
|
| 130 |
Task: "Create a synthetic dataset of 20 finance-related tasks for testing agents with stock price and ROI calculation tools"
|
| 131 |
|
| 132 |
+
Thought: I will use the run_generate_synthetic_dataset tool to create domain-specific test tasks. I'll specify the finance domain, provide the tool names, and request 20 tasks with balanced difficulty. MCP tools return string representations of dicts.
|
| 133 |
```python
|
| 134 |
+
import ast
|
| 135 |
+
synthetic_raw = run_generate_synthetic_dataset(
|
| 136 |
domain="finance",
|
| 137 |
tool_names="get_stock_price,calculate_roi,fetch_company_info",
|
| 138 |
num_tasks=20,
|
| 139 |
difficulty_distribution="balanced",
|
| 140 |
agent_type="both"
|
| 141 |
)
|
| 142 |
+
synthetic_result = ast.literal_eval(synthetic_raw) if isinstance(synthetic_raw, str) else synthetic_raw
|
| 143 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks")
|
| 144 |
print(f"Batches used: {synthetic_result['dataset_info']['num_batches']}")
|
| 145 |
print(f"Difficulty distribution: {synthetic_result['dataset_info']['difficulty_distribution']}")
|
|
|
|
| 171 |
---
|
| 172 |
Task: "Generate 50 customer support tasks and upload them to HuggingFace as 'my-org/smoltrace-customer-support-tasks'"
|
| 173 |
|
| 174 |
+
Thought: I'll first generate the synthetic dataset with 50 tasks, then use run_push_dataset_to_hub to upload it to HuggingFace. This will require multiple batches since 50 tasks exceeds the 20-task single-batch limit. MCP tools return string representations, so I need to parse them first.
|
| 175 |
```python
|
| 176 |
import json
|
| 177 |
+
import ast
|
| 178 |
# Step 1: Generate synthetic dataset
|
| 179 |
+
synthetic_raw = run_generate_synthetic_dataset(
|
| 180 |
domain="customer_support",
|
| 181 |
tool_names="search_knowledge_base,create_ticket,send_email,check_order_status",
|
| 182 |
num_tasks=50,
|
| 183 |
difficulty_distribution="progressive",
|
| 184 |
agent_type="both"
|
| 185 |
)
|
| 186 |
+
synthetic_result = ast.literal_eval(synthetic_raw) if isinstance(synthetic_raw, str) else synthetic_raw
|
| 187 |
print(f"Generated {synthetic_result['dataset_info']['num_tasks_generated']} tasks in {synthetic_result['dataset_info']['num_batches']} batches")
|
| 188 |
|
| 189 |
# Step 2: Extract tasks array and convert to JSON string for push_dataset_to_hub
|
|
|
|
| 241 |
- For overview questions (e.g., "how many runs", "average success rate"): Use `run_get_leaderboard_summary()` (99% token savings!)
|
| 242 |
- For leaderboard analysis with AI insights: Use `run_analyze_leaderboard()`
|
| 243 |
- ONLY use `run_get_dataset()` for non-leaderboard datasets (traces, results, metrics)
|
| 244 |
+
- **IMPORTANT - MCP Tool Returns**: MCP tools return STRING representations of Python dicts (with single quotes). ALWAYS use this pattern:
|
| 245 |
+
```python
|
| 246 |
+
import ast
|
| 247 |
+
result_raw = run_tool(...)
|
| 248 |
+
result = ast.literal_eval(result_raw) if isinstance(result_raw, str) else result_raw
|
| 249 |
+
```
|
| 250 |
+
Then access dict keys normally: `result['key']`. Use json.dumps() when converting dict to JSON string (e.g., for push_dataset_to_hub).
|
| 251 |
5. Call a tool only when needed, and never re-do a tool call that you previously did with the exact same parameters.
|
| 252 |
6. Don't name any new variable with the same name as a tool: for instance don't name a variable 'final_answer'.
|
| 253 |
7. Never create any notional variables in our code, as having these in your logs will derail you from the true variables.
|