Commit
Β·
dacd086
1
Parent(s):
4878d51
docs: Archive brainstorming documents and related files
Browse files- Remove outdated brainstorming documentation from the `docs/brainstorming` directory, including README, PubMed, ClinicalTrials, Europe PMC, and embeddings meta files.
- Consolidate all archived documents into the `archive/` folder for better organization and future reference.
- Ensure that the README reflects the current state of the project and directs users to the appropriate resources.
- docs/brainstorming/README.md +0 -22
- docs/brainstorming/archive/00_ROADMAP_SUMMARY.md +0 -194
- docs/brainstorming/archive/01_PUBMED_IMPROVEMENTS.md +0 -125
- docs/brainstorming/archive/02_CLINICALTRIALS_IMPROVEMENTS.md +0 -193
- docs/brainstorming/archive/03_EUROPEPMC_IMPROVEMENTS.md +0 -211
- docs/brainstorming/archive/BRAINSTORM_EMBEDDINGS_META.md +0 -74
- docs/brainstorming/archive/UI_MODE_SELECTION_UX.md +0 -133
docs/brainstorming/README.md
DELETED
|
@@ -1,22 +0,0 @@
|
|
| 1 |
-
# Brainstorming
|
| 2 |
-
|
| 3 |
-
> **Status**: All brainstorming docs archived (December 2025)
|
| 4 |
-
|
| 5 |
-
This folder contained early hackathon ideation. All documents have been moved to `archive/`.
|
| 6 |
-
|
| 7 |
-
## Archived Documents
|
| 8 |
-
|
| 9 |
-
| Document | Status | Notes |
|
| 10 |
-
|----------|--------|-------|
|
| 11 |
-
| `00_ROADMAP_SUMMARY.md` | Superseded | See `docs/future-roadmap/` |
|
| 12 |
-
| `01_PUBMED_IMPROVEMENTS.md` | Future work | Rate limiting, full-text retrieval |
|
| 13 |
-
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | Partially done | Outcomes in SPEC-14, rest is future |
|
| 14 |
-
| `03_EUROPEPMC_IMPROVEMENTS.md` | Future work | Full-text, citations |
|
| 15 |
-
| `BRAINSTORM_EMBEDDINGS_META.md` | Closed | Decision: don't implement internal embeddings |
|
| 16 |
-
| `UI_MODE_SELECTION_UX.md` | Obsolete | Simple mode removed, Anthropic removed |
|
| 17 |
-
|
| 18 |
-
## Where to Look Now
|
| 19 |
-
|
| 20 |
-
- **Future improvements**: `docs/future-roadmap/`
|
| 21 |
-
- **Current architecture**: `docs/ARCHITECTURE.md`
|
| 22 |
-
- **Active bugs**: `docs/bugs/ACTIVE_BUGS.md`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/00_ROADMAP_SUMMARY.md
DELETED
|
@@ -1,194 +0,0 @@
|
|
| 1 |
-
# DeepBoner Data Sources: Roadmap Summary
|
| 2 |
-
|
| 3 |
-
**Created**: 2024-11-27
|
| 4 |
-
**Purpose**: Future maintainability and hackathon continuation
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Current State
|
| 9 |
-
|
| 10 |
-
### Working Tools
|
| 11 |
-
|
| 12 |
-
| Tool | Status | Data Quality |
|
| 13 |
-
|------|--------|--------------|
|
| 14 |
-
| PubMed | β
Works | Good (abstracts only) |
|
| 15 |
-
| ClinicalTrials.gov | β
Works | Good (filtered for interventional) |
|
| 16 |
-
| Europe PMC | β
Works | Good (includes preprints) |
|
| 17 |
-
|
| 18 |
-
### Removed Tools
|
| 19 |
-
|
| 20 |
-
| Tool | Status | Reason |
|
| 21 |
-
|------|--------|--------|
|
| 22 |
-
| bioRxiv | β Removed | No search API - only date/DOI lookup |
|
| 23 |
-
|
| 24 |
-
---
|
| 25 |
-
|
| 26 |
-
## Priority Improvements
|
| 27 |
-
|
| 28 |
-
### P0: Critical (Do First)
|
| 29 |
-
|
| 30 |
-
1. **Add Rate Limiting to PubMed**
|
| 31 |
-
- NCBI will block us without it
|
| 32 |
-
- Use `limits` library (see reference repo)
|
| 33 |
-
- 3/sec without key, 10/sec with key
|
| 34 |
-
|
| 35 |
-
### P1: High Value, Medium Effort
|
| 36 |
-
|
| 37 |
-
2. **Add OpenAlex as 4th Source**
|
| 38 |
-
- Citation network (huge for drug repurposing)
|
| 39 |
-
- Concept tagging (semantic discovery)
|
| 40 |
-
- Already implemented in reference repo
|
| 41 |
-
- Free, no API key
|
| 42 |
-
|
| 43 |
-
3. **PubMed Full-Text via BioC**
|
| 44 |
-
- Get full paper text for PMC papers
|
| 45 |
-
- Already in reference repo
|
| 46 |
-
|
| 47 |
-
### P2: Nice to Have
|
| 48 |
-
|
| 49 |
-
4. **ClinicalTrials.gov Results**
|
| 50 |
-
- Get efficacy data from completed trials
|
| 51 |
-
- Requires more complex API calls
|
| 52 |
-
|
| 53 |
-
5. **Europe PMC Annotations**
|
| 54 |
-
- Text-mined entities (genes, drugs, diseases)
|
| 55 |
-
- Automatic entity extraction
|
| 56 |
-
|
| 57 |
-
---
|
| 58 |
-
|
| 59 |
-
## Effort Estimates
|
| 60 |
-
|
| 61 |
-
| Improvement | Effort | Impact | Priority |
|
| 62 |
-
|-------------|--------|--------|----------|
|
| 63 |
-
| PubMed rate limiting | 1 hour | Stability | P0 |
|
| 64 |
-
| OpenAlex basic search | 2 hours | High | P1 |
|
| 65 |
-
| OpenAlex citations | 2 hours | Very High | P1 |
|
| 66 |
-
| PubMed full-text | 3 hours | Medium | P1 |
|
| 67 |
-
| CT.gov results | 4 hours | Medium | P2 |
|
| 68 |
-
| Europe PMC annotations | 3 hours | Medium | P2 |
|
| 69 |
-
|
| 70 |
-
---
|
| 71 |
-
|
| 72 |
-
## Architecture Decision
|
| 73 |
-
|
| 74 |
-
### Option A: Keep Current + Add OpenAlex
|
| 75 |
-
|
| 76 |
-
```
|
| 77 |
-
User Query
|
| 78 |
-
β
|
| 79 |
-
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 80 |
-
β β β
|
| 81 |
-
PubMed ClinicalTrials Europe PMC
|
| 82 |
-
(abstracts) (trials only) (preprints)
|
| 83 |
-
β β β
|
| 84 |
-
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 85 |
-
β
|
| 86 |
-
OpenAlex β NEW
|
| 87 |
-
(citations, concepts)
|
| 88 |
-
β
|
| 89 |
-
Orchestrator
|
| 90 |
-
β
|
| 91 |
-
Report
|
| 92 |
-
```
|
| 93 |
-
|
| 94 |
-
**Pros**: Low risk, additive
|
| 95 |
-
**Cons**: More complexity, some overlap
|
| 96 |
-
|
| 97 |
-
### Option B: OpenAlex as Primary
|
| 98 |
-
|
| 99 |
-
```
|
| 100 |
-
User Query
|
| 101 |
-
β
|
| 102 |
-
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 103 |
-
β β β
|
| 104 |
-
OpenAlex ClinicalTrials Europe PMC
|
| 105 |
-
(primary (trials only) (full-text
|
| 106 |
-
search) fallback)
|
| 107 |
-
β β β
|
| 108 |
-
βββββββββββββββββββββΌββββββββββββββββββββ
|
| 109 |
-
β
|
| 110 |
-
Orchestrator
|
| 111 |
-
β
|
| 112 |
-
Report
|
| 113 |
-
```
|
| 114 |
-
|
| 115 |
-
**Pros**: Simpler, citation network built-in
|
| 116 |
-
**Cons**: Lose some PubMed-specific features
|
| 117 |
-
|
| 118 |
-
### Recommendation: Option A
|
| 119 |
-
|
| 120 |
-
Keep current architecture working, add OpenAlex incrementally.
|
| 121 |
-
|
| 122 |
-
---
|
| 123 |
-
|
| 124 |
-
## Quick Wins (Can Do Today)
|
| 125 |
-
|
| 126 |
-
1. **Add `limits` to `pyproject.toml`**
|
| 127 |
-
```toml
|
| 128 |
-
dependencies = [
|
| 129 |
-
"limits>=3.0",
|
| 130 |
-
]
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
2. **Copy OpenAlex tool from reference repo**
|
| 134 |
-
- File: `reference_repos/DeepBoner/DeepResearch/src/tools/openalex_tools.py`
|
| 135 |
-
- Adapt to our `SearchTool` base class
|
| 136 |
-
|
| 137 |
-
3. **Enable NCBI API Key**
|
| 138 |
-
- Add to `.env`: `NCBI_API_KEY=your_key`
|
| 139 |
-
- 10x rate limit improvement
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
-
## External Resources Worth Exploring
|
| 144 |
-
|
| 145 |
-
### Python Libraries
|
| 146 |
-
|
| 147 |
-
| Library | For | Notes |
|
| 148 |
-
|---------|-----|-------|
|
| 149 |
-
| `limits` | Rate limiting | Used by reference repo |
|
| 150 |
-
| `pyalex` | OpenAlex wrapper | [GitHub](https://github.com/J535D165/pyalex) |
|
| 151 |
-
| `metapub` | PubMed | Full-featured |
|
| 152 |
-
| `sentence-transformers` | Semantic search | For embeddings |
|
| 153 |
-
|
| 154 |
-
### APIs Not Yet Used
|
| 155 |
-
|
| 156 |
-
| API | Provides | Effort |
|
| 157 |
-
|-----|----------|--------|
|
| 158 |
-
| RxNorm | Drug name normalization | Low |
|
| 159 |
-
| DrugBank | Drug targets/mechanisms | Medium (license) |
|
| 160 |
-
| UniProt | Protein data | Medium |
|
| 161 |
-
| ChEMBL | Bioactivity data | Medium |
|
| 162 |
-
|
| 163 |
-
### RAG Tools (Future)
|
| 164 |
-
|
| 165 |
-
| Tool | Purpose |
|
| 166 |
-
|------|---------|
|
| 167 |
-
| [PaperQA](https://github.com/Future-House/paper-qa) | RAG for scientific papers |
|
| 168 |
-
| [txtai](https://github.com/neuml/txtai) | Embeddings + search |
|
| 169 |
-
| [PubMedBERT](https://huggingface.co/NeuML/pubmedbert-base-embeddings) | Biomedical embeddings |
|
| 170 |
-
|
| 171 |
-
---
|
| 172 |
-
|
| 173 |
-
## Files in This Directory
|
| 174 |
-
|
| 175 |
-
| File | Contents |
|
| 176 |
-
|------|----------|
|
| 177 |
-
| `00_ROADMAP_SUMMARY.md` | This file |
|
| 178 |
-
| `01_PUBMED_IMPROVEMENTS.md` | PubMed enhancement details |
|
| 179 |
-
| `02_CLINICALTRIALS_IMPROVEMENTS.md` | ClinicalTrials.gov details |
|
| 180 |
-
| `03_EUROPEPMC_IMPROVEMENTS.md` | Europe PMC details |
|
| 181 |
-
| `04_OPENALEX_INTEGRATION.md` | OpenAlex integration plan |
|
| 182 |
-
|
| 183 |
-
---
|
| 184 |
-
|
| 185 |
-
## For Future Maintainers
|
| 186 |
-
|
| 187 |
-
If you're picking this up after the hackathon:
|
| 188 |
-
|
| 189 |
-
1. **Start with OpenAlex** - biggest bang for buck
|
| 190 |
-
2. **Add rate limiting** - prevents API blocks
|
| 191 |
-
3. **Don't bother with bioRxiv** - use Europe PMC instead
|
| 192 |
-
4. **Reference repo is gold** - `reference_repos/DeepBoner/` has working implementations
|
| 193 |
-
|
| 194 |
-
Good luck! π
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/01_PUBMED_IMPROVEMENTS.md
DELETED
|
@@ -1,125 +0,0 @@
|
|
| 1 |
-
# PubMed Tool: Current State & Future Improvements
|
| 2 |
-
|
| 3 |
-
**Status**: Currently Implemented
|
| 4 |
-
**Priority**: High (Core Data Source)
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Current Implementation
|
| 9 |
-
|
| 10 |
-
### What We Have (`src/tools/pubmed.py`)
|
| 11 |
-
|
| 12 |
-
- Basic E-utilities search via `esearch.fcgi` and `efetch.fcgi`
|
| 13 |
-
- Query preprocessing (strips question words, expands synonyms)
|
| 14 |
-
- Returns: title, abstract, authors, journal, PMID
|
| 15 |
-
- Rate limiting: None implemented (relying on NCBI defaults)
|
| 16 |
-
|
| 17 |
-
### Current Limitations
|
| 18 |
-
|
| 19 |
-
1. **No Full-Text Access**: Only retrieves abstracts, not full paper text
|
| 20 |
-
2. **No Rate Limiting**: Risk of being blocked by NCBI
|
| 21 |
-
3. **No BioC Format**: Missing structured full-text extraction
|
| 22 |
-
4. **No Figure Retrieval**: No supplementary materials access
|
| 23 |
-
5. **No PMC Integration**: Missing open-access full-text via PMC
|
| 24 |
-
|
| 25 |
-
---
|
| 26 |
-
|
| 27 |
-
## Reference Implementation (DeepBoner Reference Repo)
|
| 28 |
-
|
| 29 |
-
The reference repo at `reference_repos/DeepBoner/DeepResearch/src/tools/bioinformatics_tools.py` has a more sophisticated implementation:
|
| 30 |
-
|
| 31 |
-
### Features We're Missing
|
| 32 |
-
|
| 33 |
-
```python
|
| 34 |
-
# Rate limiting (lines 47-50)
|
| 35 |
-
from limits import parse
|
| 36 |
-
from limits.storage import MemoryStorage
|
| 37 |
-
from limits.strategies import MovingWindowRateLimiter
|
| 38 |
-
|
| 39 |
-
storage = MemoryStorage()
|
| 40 |
-
limiter = MovingWindowRateLimiter(storage)
|
| 41 |
-
rate_limit = parse("3/second") # NCBI allows 3/sec without API key, 10/sec with
|
| 42 |
-
|
| 43 |
-
# Full-text via BioC format (lines 108-120)
|
| 44 |
-
def _get_fulltext(pmid: int) -> dict[str, Any] | None:
|
| 45 |
-
pmid_url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
|
| 46 |
-
# Returns structured JSON with full text for open-access papers
|
| 47 |
-
|
| 48 |
-
# Figure retrieval via Europe PMC (lines 123-149)
|
| 49 |
-
def _get_figures(pmcid: str) -> dict[str, str]:
|
| 50 |
-
suppl_url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles"
|
| 51 |
-
# Returns base64-encoded images from supplementary materials
|
| 52 |
-
```
|
| 53 |
-
|
| 54 |
-
---
|
| 55 |
-
|
| 56 |
-
## Recommended Improvements
|
| 57 |
-
|
| 58 |
-
### Phase 1: Rate Limiting (Critical)
|
| 59 |
-
|
| 60 |
-
```python
|
| 61 |
-
# Add to src/tools/pubmed.py
|
| 62 |
-
from limits import parse
|
| 63 |
-
from limits.storage import MemoryStorage
|
| 64 |
-
from limits.strategies import MovingWindowRateLimiter
|
| 65 |
-
|
| 66 |
-
storage = MemoryStorage()
|
| 67 |
-
limiter = MovingWindowRateLimiter(storage)
|
| 68 |
-
|
| 69 |
-
# With NCBI_API_KEY: 10/sec, without: 3/sec
|
| 70 |
-
def get_rate_limit():
|
| 71 |
-
if settings.ncbi_api_key:
|
| 72 |
-
return parse("10/second")
|
| 73 |
-
return parse("3/second")
|
| 74 |
-
```
|
| 75 |
-
|
| 76 |
-
**Dependencies**: `pip install limits`
|
| 77 |
-
|
| 78 |
-
### Phase 2: Full-Text Retrieval
|
| 79 |
-
|
| 80 |
-
```python
|
| 81 |
-
async def get_fulltext(pmid: str) -> str | None:
|
| 82 |
-
"""Get full text for open-access papers via BioC API."""
|
| 83 |
-
url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmid}/unicode"
|
| 84 |
-
# Only works for PMC papers (open access)
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
### Phase 3: PMC ID Resolution
|
| 88 |
-
|
| 89 |
-
```python
|
| 90 |
-
async def get_pmc_id(pmid: str) -> str | None:
|
| 91 |
-
"""Convert PMID to PMCID for full-text access."""
|
| 92 |
-
url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/idconv/v1.0/?ids={pmid}&format=json"
|
| 93 |
-
```
|
| 94 |
-
|
| 95 |
-
---
|
| 96 |
-
|
| 97 |
-
## Python Libraries to Consider
|
| 98 |
-
|
| 99 |
-
| Library | Purpose | Notes |
|
| 100 |
-
|---------|---------|-------|
|
| 101 |
-
| [Biopython](https://biopython.org/) | `Bio.Entrez` module | Official, well-maintained |
|
| 102 |
-
| [PyMed](https://pypi.org/project/pymed/) | PubMed wrapper | Simpler API, less control |
|
| 103 |
-
| [metapub](https://pypi.org/project/metapub/) | Full-featured | Tested on 1/3 of PubMed |
|
| 104 |
-
| [limits](https://pypi.org/project/limits/) | Rate limiting | Used by reference repo |
|
| 105 |
-
|
| 106 |
-
---
|
| 107 |
-
|
| 108 |
-
## API Endpoints Reference
|
| 109 |
-
|
| 110 |
-
| Endpoint | Purpose | Rate Limit |
|
| 111 |
-
|----------|---------|------------|
|
| 112 |
-
| `esearch.fcgi` | Search for PMIDs | 3/sec (10 with key) |
|
| 113 |
-
| `efetch.fcgi` | Fetch metadata | 3/sec (10 with key) |
|
| 114 |
-
| `esummary.fcgi` | Quick metadata | 3/sec (10 with key) |
|
| 115 |
-
| `pmcoa.cgi/BioC_json` | Full text (PMC only) | Unknown |
|
| 116 |
-
| `idconv/v1.0` | PMID β PMCID | Unknown |
|
| 117 |
-
|
| 118 |
-
---
|
| 119 |
-
|
| 120 |
-
## Sources
|
| 121 |
-
|
| 122 |
-
- [PubMed E-utilities Documentation](https://www.ncbi.nlm.nih.gov/books/NBK25501/)
|
| 123 |
-
- [NCBI BioC API](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/)
|
| 124 |
-
- [Searching PubMed with Python](https://marcobonzanini.com/2015/01/12/searching-pubmed-with-python/)
|
| 125 |
-
- [PyMed on PyPI](https://pypi.org/project/pymed/)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/02_CLINICALTRIALS_IMPROVEMENTS.md
DELETED
|
@@ -1,193 +0,0 @@
|
|
| 1 |
-
# ClinicalTrials.gov Tool: Current State & Future Improvements
|
| 2 |
-
|
| 3 |
-
**Status**: Currently Implemented
|
| 4 |
-
**Priority**: High (Core Data Source for Drug Repurposing)
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Current Implementation
|
| 9 |
-
|
| 10 |
-
### What We Have (`src/tools/clinicaltrials.py`)
|
| 11 |
-
|
| 12 |
-
- V2 API search via `clinicaltrials.gov/api/v2/studies`
|
| 13 |
-
- Filters: `INTERVENTIONAL` study type, `RECRUITING` status
|
| 14 |
-
- Returns: NCT ID, title, conditions, interventions, phase, status
|
| 15 |
-
- Query preprocessing via shared `query_utils.py`
|
| 16 |
-
|
| 17 |
-
### Current Strengths
|
| 18 |
-
|
| 19 |
-
1. **Good Filtering**: Already filtering for interventional + recruiting
|
| 20 |
-
2. **V2 API**: Using the modern API (v1 deprecated)
|
| 21 |
-
3. **Phase Info**: Extracting trial phases for drug development context
|
| 22 |
-
|
| 23 |
-
### Current Limitations
|
| 24 |
-
|
| 25 |
-
1. **No Outcome Data**: Missing primary/secondary outcomes
|
| 26 |
-
2. **No Eligibility Criteria**: Missing inclusion/exclusion details
|
| 27 |
-
3. **No Sponsor Info**: Missing who's running the trial
|
| 28 |
-
4. **No Result Data**: For completed trials, no efficacy data
|
| 29 |
-
5. **Limited Drug Mapping**: No integration with drug databases
|
| 30 |
-
|
| 31 |
-
---
|
| 32 |
-
|
| 33 |
-
## API Capabilities We're Not Using
|
| 34 |
-
|
| 35 |
-
### Fields We Could Request
|
| 36 |
-
|
| 37 |
-
```python
|
| 38 |
-
# Current fields
|
| 39 |
-
fields = ["NCTId", "BriefTitle", "Condition", "InterventionName", "Phase", "OverallStatus"]
|
| 40 |
-
|
| 41 |
-
# Additional valuable fields
|
| 42 |
-
additional_fields = [
|
| 43 |
-
"PrimaryOutcomeMeasure", # What are they measuring?
|
| 44 |
-
"SecondaryOutcomeMeasure", # Secondary endpoints
|
| 45 |
-
"EligibilityCriteria", # Who can participate?
|
| 46 |
-
"LeadSponsorName", # Who's funding?
|
| 47 |
-
"ResultsFirstPostDate", # Has results?
|
| 48 |
-
"StudyFirstPostDate", # When started?
|
| 49 |
-
"CompletionDate", # When finished?
|
| 50 |
-
"EnrollmentCount", # Sample size
|
| 51 |
-
"InterventionDescription", # Drug details
|
| 52 |
-
"ArmGroupLabel", # Treatment arms
|
| 53 |
-
"InterventionOtherName", # Drug aliases
|
| 54 |
-
]
|
| 55 |
-
```
|
| 56 |
-
|
| 57 |
-
### Filter Enhancements
|
| 58 |
-
|
| 59 |
-
```python
|
| 60 |
-
# Current
|
| 61 |
-
aggFilters = "studyType:INTERVENTIONAL,status:RECRUITING"
|
| 62 |
-
|
| 63 |
-
# Could add
|
| 64 |
-
"status:RECRUITING,ACTIVE_NOT_RECRUITING,COMPLETED" # Include completed for results
|
| 65 |
-
"phase:PHASE2,PHASE3" # Only later-stage trials
|
| 66 |
-
"resultsFirstPostDateRange:2020-01-01_" # Trials with posted results
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
---
|
| 70 |
-
|
| 71 |
-
## Recommended Improvements
|
| 72 |
-
|
| 73 |
-
### Phase 1: Richer Metadata
|
| 74 |
-
|
| 75 |
-
```python
|
| 76 |
-
EXTENDED_FIELDS = [
|
| 77 |
-
"NCTId",
|
| 78 |
-
"BriefTitle",
|
| 79 |
-
"OfficialTitle",
|
| 80 |
-
"Condition",
|
| 81 |
-
"InterventionName",
|
| 82 |
-
"InterventionDescription",
|
| 83 |
-
"InterventionOtherName", # Drug synonyms!
|
| 84 |
-
"Phase",
|
| 85 |
-
"OverallStatus",
|
| 86 |
-
"PrimaryOutcomeMeasure",
|
| 87 |
-
"EnrollmentCount",
|
| 88 |
-
"LeadSponsorName",
|
| 89 |
-
"StudyFirstPostDate",
|
| 90 |
-
]
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
### Phase 2: Results Retrieval
|
| 94 |
-
|
| 95 |
-
For completed trials, we can get actual efficacy data:
|
| 96 |
-
|
| 97 |
-
```python
|
| 98 |
-
async def get_trial_results(nct_id: str) -> dict | None:
|
| 99 |
-
"""Fetch results for completed trials."""
|
| 100 |
-
url = f"https://clinicaltrials.gov/api/v2/studies/{nct_id}"
|
| 101 |
-
params = {
|
| 102 |
-
"fields": "ResultsSection",
|
| 103 |
-
}
|
| 104 |
-
# Returns outcome measures and statistics
|
| 105 |
-
```
|
| 106 |
-
|
| 107 |
-
### Phase 3: Drug Name Normalization
|
| 108 |
-
|
| 109 |
-
Map intervention names to standard identifiers:
|
| 110 |
-
|
| 111 |
-
```python
|
| 112 |
-
# Problem: "Metformin", "Metformin HCl", "Glucophage" are the same drug
|
| 113 |
-
# Solution: Use RxNorm or DrugBank for normalization
|
| 114 |
-
|
| 115 |
-
async def normalize_drug_name(intervention: str) -> str:
|
| 116 |
-
"""Normalize drug name via RxNorm API."""
|
| 117 |
-
url = f"https://rxnav.nlm.nih.gov/REST/rxcui.json?name={intervention}"
|
| 118 |
-
# Returns standardized RxCUI
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
---
|
| 122 |
-
|
| 123 |
-
## Integration Opportunities
|
| 124 |
-
|
| 125 |
-
### With PubMed
|
| 126 |
-
|
| 127 |
-
Cross-reference trials with publications:
|
| 128 |
-
```python
|
| 129 |
-
# ClinicalTrials.gov provides PMID links
|
| 130 |
-
# Can correlate trial results with published papers
|
| 131 |
-
```
|
| 132 |
-
|
| 133 |
-
### With DrugBank/ChEMBL
|
| 134 |
-
|
| 135 |
-
Map interventions to:
|
| 136 |
-
- Mechanism of action
|
| 137 |
-
- Known targets
|
| 138 |
-
- Adverse effects
|
| 139 |
-
- Drug-drug interactions
|
| 140 |
-
|
| 141 |
-
---
|
| 142 |
-
|
| 143 |
-
## Python Libraries to Consider
|
| 144 |
-
|
| 145 |
-
| Library | Purpose | Notes |
|
| 146 |
-
|---------|---------|-------|
|
| 147 |
-
| [pytrials](https://pypi.org/project/pytrials/) | CT.gov wrapper | V2 API support unclear |
|
| 148 |
-
| [clinicaltrials](https://github.com/ebmdatalab/clinicaltrials-act-tracker) | Data tracking | More for analysis |
|
| 149 |
-
| [drugbank-downloader](https://pypi.org/project/drugbank-downloader/) | Drug mapping | Requires license |
|
| 150 |
-
|
| 151 |
-
---
|
| 152 |
-
|
| 153 |
-
## API Quirks & Gotchas
|
| 154 |
-
|
| 155 |
-
1. **Rate Limiting**: Undocumented, be conservative
|
| 156 |
-
2. **Pagination**: Max 1000 results per request
|
| 157 |
-
3. **Field Names**: Case-sensitive, camelCase
|
| 158 |
-
4. **Empty Results**: Some fields may be null even if requested
|
| 159 |
-
5. **Status Changes**: Trials change status frequently
|
| 160 |
-
|
| 161 |
-
---
|
| 162 |
-
|
| 163 |
-
## Example Enhanced Query
|
| 164 |
-
|
| 165 |
-
```python
|
| 166 |
-
async def search_drug_repurposing_trials(
|
| 167 |
-
drug_name: str,
|
| 168 |
-
condition: str,
|
| 169 |
-
include_completed: bool = True,
|
| 170 |
-
) -> list[Evidence]:
|
| 171 |
-
"""Search for trials repurposing a drug for a new condition."""
|
| 172 |
-
|
| 173 |
-
statuses = ["RECRUITING", "ACTIVE_NOT_RECRUITING"]
|
| 174 |
-
if include_completed:
|
| 175 |
-
statuses.append("COMPLETED")
|
| 176 |
-
|
| 177 |
-
params = {
|
| 178 |
-
"query.intr": drug_name,
|
| 179 |
-
"query.cond": condition,
|
| 180 |
-
"filter.overallStatus": ",".join(statuses),
|
| 181 |
-
"filter.studyType": "INTERVENTIONAL",
|
| 182 |
-
"fields": ",".join(EXTENDED_FIELDS),
|
| 183 |
-
"pageSize": 50,
|
| 184 |
-
}
|
| 185 |
-
```
|
| 186 |
-
|
| 187 |
-
---
|
| 188 |
-
|
| 189 |
-
## Sources
|
| 190 |
-
|
| 191 |
-
- [ClinicalTrials.gov API Documentation](https://clinicaltrials.gov/data-api/api)
|
| 192 |
-
- [CT.gov Field Definitions](https://clinicaltrials.gov/data-api/about-api/study-data-structure)
|
| 193 |
-
- [RxNorm API](https://lhncbc.nlm.nih.gov/RxNav/APIs/api-RxNorm.findRxcuiByString.html)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/03_EUROPEPMC_IMPROVEMENTS.md
DELETED
|
@@ -1,211 +0,0 @@
|
|
| 1 |
-
# Europe PMC Tool: Current State & Future Improvements
|
| 2 |
-
|
| 3 |
-
**Status**: Currently Implemented (Replaced bioRxiv)
|
| 4 |
-
**Priority**: High (Preprint + Open Access Source)
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## Why Europe PMC Over bioRxiv?
|
| 9 |
-
|
| 10 |
-
### bioRxiv API Limitations (Why We Abandoned It)
|
| 11 |
-
|
| 12 |
-
1. **No Search API**: Only returns papers by date range or DOI
|
| 13 |
-
2. **No Query Capability**: Cannot search for "metformin cancer"
|
| 14 |
-
3. **Workaround Required**: Would need to download ALL preprints and build local search
|
| 15 |
-
4. **Known Issue**: [Gradio Issue #8861](https://github.com/gradio-app/gradio/issues/8861) documents the limitation
|
| 16 |
-
|
| 17 |
-
### Europe PMC Advantages
|
| 18 |
-
|
| 19 |
-
1. **Full Search API**: Boolean queries, filters, facets
|
| 20 |
-
2. **Aggregates bioRxiv**: Includes bioRxiv, medRxiv content anyway
|
| 21 |
-
3. **Includes PubMed**: Also has MEDLINE content
|
| 22 |
-
4. **34 Preprint Servers**: Not just bioRxiv
|
| 23 |
-
5. **Open Access Focus**: Full-text when available
|
| 24 |
-
|
| 25 |
-
---
|
| 26 |
-
|
| 27 |
-
## Current Implementation
|
| 28 |
-
|
| 29 |
-
### What We Have (`src/tools/europepmc.py`)
|
| 30 |
-
|
| 31 |
-
- REST API search via `europepmc.org/webservices/rest/search`
|
| 32 |
-
- Preprint flagging via `firstPublicationDate` heuristics
|
| 33 |
-
- Returns: title, abstract, authors, DOI, source
|
| 34 |
-
- Marks preprints for transparency
|
| 35 |
-
|
| 36 |
-
### Current Limitations
|
| 37 |
-
|
| 38 |
-
1. **No Full-Text Retrieval**: Only metadata/abstracts
|
| 39 |
-
2. **No Citation Network**: Missing references/citations
|
| 40 |
-
3. **No Supplementary Files**: Not fetching figures/data
|
| 41 |
-
4. **Basic Preprint Detection**: Heuristic, not explicit flag
|
| 42 |
-
|
| 43 |
-
---
|
| 44 |
-
|
| 45 |
-
## Europe PMC API Capabilities
|
| 46 |
-
|
| 47 |
-
### Endpoints We Could Use
|
| 48 |
-
|
| 49 |
-
| Endpoint | Purpose | Currently Using |
|
| 50 |
-
|----------|---------|-----------------|
|
| 51 |
-
| `/search` | Query papers | Yes |
|
| 52 |
-
| `/fulltext/{ID}` | Full text (XML/JSON) | No |
|
| 53 |
-
| `/{PMCID}/supplementaryFiles` | Figures, data | No |
|
| 54 |
-
| `/citations/{ID}` | Who cited this | No |
|
| 55 |
-
| `/references/{ID}` | What this cites | No |
|
| 56 |
-
| `/annotations` | Text-mined entities | No |
|
| 57 |
-
|
| 58 |
-
### Rich Query Syntax
|
| 59 |
-
|
| 60 |
-
```python
|
| 61 |
-
# Current simple query
|
| 62 |
-
query = "metformin cancer"
|
| 63 |
-
|
| 64 |
-
# Could use advanced syntax
|
| 65 |
-
query = "(TITLE:metformin OR ABSTRACT:metformin) AND (cancer OR oncology)"
|
| 66 |
-
query += " AND (SRC:PPR)" # Only preprints
|
| 67 |
-
query += " AND (FIRST_PDATE:[2023-01-01 TO 2024-12-31])" # Date range
|
| 68 |
-
query += " AND (OPEN_ACCESS:y)" # Only open access
|
| 69 |
-
```
|
| 70 |
-
|
| 71 |
-
### Source Filters
|
| 72 |
-
|
| 73 |
-
```python
|
| 74 |
-
# Filter by source
|
| 75 |
-
"SRC:MED" # MEDLINE
|
| 76 |
-
"SRC:PMC" # PubMed Central
|
| 77 |
-
"SRC:PPR" # Preprints (bioRxiv, medRxiv, etc.)
|
| 78 |
-
"SRC:AGR" # Agricola
|
| 79 |
-
"SRC:CBA" # Chinese Biological Abstracts
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
---
|
| 83 |
-
|
| 84 |
-
## Recommended Improvements
|
| 85 |
-
|
| 86 |
-
### Phase 1: Rich Metadata
|
| 87 |
-
|
| 88 |
-
```python
|
| 89 |
-
# Add to search results
|
| 90 |
-
additional_fields = [
|
| 91 |
-
"citedByCount", # Impact indicator
|
| 92 |
-
"source", # Explicit source (MED, PMC, PPR)
|
| 93 |
-
"isOpenAccess", # Boolean flag
|
| 94 |
-
"fullTextUrlList", # URLs for full text
|
| 95 |
-
"authorAffiliations", # Institution info
|
| 96 |
-
"grantsList", # Funding info
|
| 97 |
-
]
|
| 98 |
-
```
|
| 99 |
-
|
| 100 |
-
### Phase 2: Full-Text Retrieval
|
| 101 |
-
|
| 102 |
-
```python
|
| 103 |
-
async def get_fulltext(pmcid: str) -> str | None:
|
| 104 |
-
"""Get full text for open access papers."""
|
| 105 |
-
# XML format
|
| 106 |
-
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextXML"
|
| 107 |
-
# Or JSON
|
| 108 |
-
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/fullTextJSON"
|
| 109 |
-
```
|
| 110 |
-
|
| 111 |
-
### Phase 3: Citation Network
|
| 112 |
-
|
| 113 |
-
```python
|
| 114 |
-
async def get_citations(pmcid: str) -> list[str]:
|
| 115 |
-
"""Get papers that cite this one."""
|
| 116 |
-
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/citations"
|
| 117 |
-
|
| 118 |
-
async def get_references(pmcid: str) -> list[str]:
|
| 119 |
-
"""Get papers this one cites."""
|
| 120 |
-
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/references"
|
| 121 |
-
```
|
| 122 |
-
|
| 123 |
-
### Phase 4: Text-Mined Annotations
|
| 124 |
-
|
| 125 |
-
Europe PMC extracts entities automatically:
|
| 126 |
-
|
| 127 |
-
```python
|
| 128 |
-
async def get_annotations(pmcid: str) -> dict:
|
| 129 |
-
"""Get text-mined entities (genes, diseases, drugs)."""
|
| 130 |
-
url = f"https://www.ebi.ac.uk/europepmc/annotations_api/annotationsByArticleIds"
|
| 131 |
-
params = {
|
| 132 |
-
"articleIds": f"PMC:{pmcid}",
|
| 133 |
-
"type": "Gene_Proteins,Diseases,Chemicals",
|
| 134 |
-
"format": "JSON",
|
| 135 |
-
}
|
| 136 |
-
# Returns structured entity mentions with positions
|
| 137 |
-
```
|
| 138 |
-
|
| 139 |
-
---
|
| 140 |
-
|
| 141 |
-
## Supplementary File Retrieval
|
| 142 |
-
|
| 143 |
-
From reference repo (`bioinformatics_tools.py` lines 123-149):
|
| 144 |
-
|
| 145 |
-
```python
|
| 146 |
-
def get_figures(pmcid: str) -> dict[str, str]:
|
| 147 |
-
"""Download figures and supplementary files."""
|
| 148 |
-
url = f"https://www.ebi.ac.uk/europepmc/webservices/rest/{pmcid}/supplementaryFiles?includeInlineImage=true"
|
| 149 |
-
# Returns ZIP with images, returns base64-encoded
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
---
|
| 153 |
-
|
| 154 |
-
## Preprint-Specific Features
|
| 155 |
-
|
| 156 |
-
### Identify Preprint Servers
|
| 157 |
-
|
| 158 |
-
```python
|
| 159 |
-
PREPRINT_SOURCES = {
|
| 160 |
-
"PPR": "General preprints",
|
| 161 |
-
"bioRxiv": "Biology preprints",
|
| 162 |
-
"medRxiv": "Medical preprints",
|
| 163 |
-
"chemRxiv": "Chemistry preprints",
|
| 164 |
-
"Research Square": "Multi-disciplinary",
|
| 165 |
-
"Preprints.org": "MDPI preprints",
|
| 166 |
-
}
|
| 167 |
-
|
| 168 |
-
# Check if published version exists
|
| 169 |
-
async def check_published_version(preprint_doi: str) -> str | None:
|
| 170 |
-
"""Check if preprint has been peer-reviewed and published."""
|
| 171 |
-
# Europe PMC links preprints to final versions
|
| 172 |
-
```
|
| 173 |
-
|
| 174 |
-
---
|
| 175 |
-
|
| 176 |
-
## Rate Limiting
|
| 177 |
-
|
| 178 |
-
Europe PMC is more generous than NCBI:
|
| 179 |
-
|
| 180 |
-
```python
|
| 181 |
-
# No documented hard limit, but be respectful
|
| 182 |
-
# Recommend: 10-20 requests/second max
|
| 183 |
-
# Use email in User-Agent for polite pool
|
| 184 |
-
headers = {
|
| 185 |
-
"User-Agent": "DeepBoner/1.0 (mailto:your@email.com)"
|
| 186 |
-
}
|
| 187 |
-
```
|
| 188 |
-
|
| 189 |
-
---
|
| 190 |
-
|
| 191 |
-
## vs. The Lens & OpenAlex
|
| 192 |
-
|
| 193 |
-
| Feature | Europe PMC | The Lens | OpenAlex |
|
| 194 |
-
|---------|------------|----------|----------|
|
| 195 |
-
| Biomedical Focus | Yes | Partial | Partial |
|
| 196 |
-
| Preprints | Yes (34 servers) | Yes | Yes |
|
| 197 |
-
| Full Text | PMC papers | Links | No |
|
| 198 |
-
| Citations | Yes | Yes | Yes |
|
| 199 |
-
| Annotations | Yes (text-mined) | No | No |
|
| 200 |
-
| Rate Limits | Generous | Moderate | Very generous |
|
| 201 |
-
| API Key | Optional | Required | Optional |
|
| 202 |
-
|
| 203 |
-
---
|
| 204 |
-
|
| 205 |
-
## Sources
|
| 206 |
-
|
| 207 |
-
- [Europe PMC REST API](https://europepmc.org/RestfulWebService)
|
| 208 |
-
- [Europe PMC Annotations API](https://europepmc.org/AnnotationsApi)
|
| 209 |
-
- [Europe PMC Articles API](https://europepmc.org/ArticlesApi)
|
| 210 |
-
- [rOpenSci medrxivr](https://docs.ropensci.org/medrxivr/)
|
| 211 |
-
- [bioRxiv TDM Resources](https://www.biorxiv.org/tdm)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/BRAINSTORM_EMBEDDINGS_META.md
DELETED
|
@@ -1,74 +0,0 @@
|
|
| 1 |
-
# Embeddings Brainstorm - Conclusions
|
| 2 |
-
|
| 3 |
-
**Date**: November 2025
|
| 4 |
-
**Status**: CLOSED - Conclusions reached, no action needed
|
| 5 |
-
|
| 6 |
-
---
|
| 7 |
-
|
| 8 |
-
## The Question
|
| 9 |
-
|
| 10 |
-
Should DeepBoner implement:
|
| 11 |
-
1. Internal codebase embeddings/ingestion pipeline?
|
| 12 |
-
2. mGREP for internal tool selection?
|
| 13 |
-
3. Self-knowledge components for agents?
|
| 14 |
-
|
| 15 |
-
## The Answer: NO
|
| 16 |
-
|
| 17 |
-
After research and first-principles analysis, the conclusion is clear:
|
| 18 |
-
|
| 19 |
-
### Why Not Internal Embeddings/Ingestion
|
| 20 |
-
|
| 21 |
-
```text
|
| 22 |
-
DeepBoner's Core Task:
|
| 23 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 24 |
-
β User Query: "Evidence for testosterone in HSDD?" β
|
| 25 |
-
β β β
|
| 26 |
-
β 1. Search PubMed, ClinicalTrials, Europe PMC β
|
| 27 |
-
β 2. Judge: Is evidence sufficient? β
|
| 28 |
-
β 3. Synthesize: Generate report β
|
| 29 |
-
β β β
|
| 30 |
-
β Output: Research report with citations β
|
| 31 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 32 |
-
|
| 33 |
-
Does ANY step require self-knowledge of codebase? NO.
|
| 34 |
-
```
|
| 35 |
-
|
| 36 |
-
### Why Not mGREP for Tool Selection
|
| 37 |
-
|
| 38 |
-
| Approach | Complexity | Accuracy |
|
| 39 |
-
|----------|------------|----------|
|
| 40 |
-
| Embeddings + mGREP for tool selection | High | Medium (semantic similarity β correct tool) |
|
| 41 |
-
| Direct prompting with tool descriptions | Low | High (LLM reasons about applicability) |
|
| 42 |
-
|
| 43 |
-
**No real agent system uses embeddings for tool selection.** All major frameworks (LangChain, OpenAI, Anthropic, Magentic) use prompt-based tool selection because:
|
| 44 |
-
1. LLMs are already doing semantic matching internally
|
| 45 |
-
2. Tool count is small (5-20) - fits easily in context
|
| 46 |
-
3. Prompts allow reasoning, not just similarity
|
| 47 |
-
|
| 48 |
-
### What We Already Have
|
| 49 |
-
|
| 50 |
-
DeepBoner already uses embeddings for the **right thing**: research evidence retrieval.
|
| 51 |
-
- `src/services/embeddings.py` - ChromaDB + sentence-transformers
|
| 52 |
-
- `src/services/llamaindex_rag.py` - OpenAI embeddings for premium tier
|
| 53 |
-
|
| 54 |
-
### The Real Priority
|
| 55 |
-
|
| 56 |
-
Instead of internal embeddings/mGREP, focus on:
|
| 57 |
-
1. **Deduplication** across PubMed/Europe PMC/OpenAlex
|
| 58 |
-
2. **Outcome measures** from ClinicalTrials.gov
|
| 59 |
-
3. **Citation graph traversal** via OpenAlex
|
| 60 |
-
|
| 61 |
-
See: `TOOL_ANALYSIS_CRITICAL.md` for detailed improvement roadmap.
|
| 62 |
-
|
| 63 |
-
---
|
| 64 |
-
|
| 65 |
-
## Research Sources
|
| 66 |
-
|
| 67 |
-
- [SICA Paper (ICLR 2025)](https://arxiv.org/abs/2504.15228) - Self-improving agents
|
| 68 |
-
- [GΓΆdel Agent (ACL 2025)](https://arxiv.org/abs/2410.04444) - Recursive self-modification
|
| 69 |
-
- [Introspection Paradox (EMNLP 2025)](https://aclanthology.org/2025.emnlp-main.352/) - Self-knowledge can hurt performance
|
| 70 |
-
- [Anthropic Introspection Research](https://www.anthropic.com/research/introspection) - ~20% accuracy on genuine introspection
|
| 71 |
-
|
| 72 |
-
---
|
| 73 |
-
|
| 74 |
-
*This document is closed. The conclusion is: don't implement internal embeddings/mGREP for this use case.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/brainstorming/archive/UI_MODE_SELECTION_UX.md
DELETED
|
@@ -1,133 +0,0 @@
|
|
| 1 |
-
# UI/UX Brainstorm: Mode Selection & API Key Experience
|
| 2 |
-
|
| 3 |
-
**Date**: 2025-11-28
|
| 4 |
-
**Status**: IMPLEMENTED (2025-11-28)
|
| 5 |
-
**Related**: Issues #52, #53, PR #58
|
| 6 |
-
|
| 7 |
-
---
|
| 8 |
-
|
| 9 |
-
## CRITICAL FINDING: Anthropic Key is Nearly Useless
|
| 10 |
-
|
| 11 |
-
**Code verification** (2025-11-28):
|
| 12 |
-
```
|
| 13 |
-
grep -r "AnthropicChatClient" src/ β NO RESULTS
|
| 14 |
-
grep -r "OpenAIChatClient" src/ β 22 RESULTS (all Magentic agents)
|
| 15 |
-
```
|
| 16 |
-
|
| 17 |
-
The `agent-framework` package (Microsoft's Magentic) **ONLY** has `OpenAIChatClient`.
|
| 18 |
-
There is no `AnthropicChatClient`. This means:
|
| 19 |
-
|
| 20 |
-
| Feature | OpenAI Key | Anthropic Key |
|
| 21 |
-
|---------|------------|---------------|
|
| 22 |
-
| Simple mode (Judge LLM) | β
GPT-5.1 | β
Claude Sonnet 4.5 |
|
| 23 |
-
| Advanced mode (Multi-agent) | β
Full orchestration | β **DOES NOT WORK** |
|
| 24 |
-
| Value proposition | Full access | Simple mode only |
|
| 25 |
-
|
| 26 |
-
**Decision**: Keep Anthropic support for Simple mode, but ensure UX clearly differentiates capabilities.
|
| 27 |
-
|
| 28 |
-
---
|
| 29 |
-
|
| 30 |
-
## Current State (After PR #58)
|
| 31 |
-
|
| 32 |
-
### What Users See (Screenshot 2025-11-28)
|
| 33 |
-
|
| 34 |
-
```
|
| 35 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 36 |
-
β β‘ Examples β
|
| 37 |
-
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββ€
|
| 38 |
-
β β Orchestrator Mode β
|
| 39 |
-
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββ€
|
| 40 |
-
β What drugs improve female libido post-menopause? β simple β
|
| 41 |
-
β Clinical trials for erectile dysfunction altern... β advanced β
|
| 42 |
-
β Evidence for testosterone therapy in women with... β simple β
|
| 43 |
-
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββ
|
| 44 |
-
|
| 45 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 46 |
-
β βοΈ Mode & API Key (Free tier works!) [βΌ] β
|
| 47 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 48 |
-
β β
|
| 49 |
-
β Orchestrator Mode β
|
| 50 |
-
β β‘ Simple: Fast (Free/Any Key) | π¬ Advanced: Deep Multi-Agent (OpenAI Key Only) β
|
| 51 |
-
β [β simple] [β advanced] β
|
| 52 |
-
β β
|
| 53 |
-
β π API Key (Optional) β
|
| 54 |
-
β Leave empty for free tier. Auto-detects provider from key prefix. β
|
| 55 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 56 |
-
β β sk-... (OpenAI) or sk-ant-... (Anthropic) β β
|
| 57 |
-
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 58 |
-
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
### Observations from Screenshot
|
| 62 |
-
|
| 63 |
-
1. **Examples table**: 2 columns (Query + Mode) - clean, one example now shows "advanced" β
|
| 64 |
-
2. **One example shows "advanced"**: Improves discoverability of Advanced mode β
|
| 65 |
-
3. **Accordion collapsed by default**: Still collapsed, but with more inviting label β
|
| 66 |
-
4. **Placeholder mentions Anthropic**: Correct, but now clearly tied to Simple mode only via info text β
|
| 67 |
-
5. **"Advanced: Requires OpenAI key"**: Now more prominent with emojis and clearer phrasing in info text β
|
| 68 |
-
|
| 69 |
-
### The Two Modes
|
| 70 |
-
|
| 71 |
-
| Mode | Backend | Capabilities | Requirements |
|
| 72 |
-
|------|---------|--------------|--------------|
|
| 73 |
-
| **Simple** | Linear orchestrator | Search β Judge β Report (single pass) | None (free tier) or any API key |
|
| 74 |
-
| **Advanced** | Magentic multi-agent | SearchAgent, JudgeAgent, HypothesisAgent, ReportAgent working together with iterative refinement | **OpenAI API key only** |
|
| 75 |
-
|
| 76 |
-
---
|
| 77 |
-
|
| 78 |
-
## Problems Identified (Addressed)
|
| 79 |
-
|
| 80 |
-
### P1: Advanced Mode is Hidden β ADDRESSED
|
| 81 |
-
- **Fix**: One example now shows "advanced" mode.
|
| 82 |
-
- **Fix**: Accordion label is more descriptive.
|
| 83 |
-
|
| 84 |
-
### P2: Mode/Key Relationship is Unclear β ADDRESSED
|
| 85 |
-
- **Fix**: `gr.Radio` info text clearly states "OpenAI Key Only" for Advanced mode, using emojis for emphasis.
|
| 86 |
-
|
| 87 |
-
### P3: No Incentive to Try Advanced β PARTIALLY ADDRESSED
|
| 88 |
-
- **Fix**: Emojis and "Deep Multi-Agent" hint at the value. Further marketing/documentation still needed for full "wow" moment.
|
| 89 |
-
|
| 90 |
-
### P4: Anthropic Users Left Out β ADDRESSED (Clarified)
|
| 91 |
-
- **Fix**: Anthropic keys still work for Simple mode, and the info text clarifies the limitation for Advanced mode.
|
| 92 |
-
|
| 93 |
-
---
|
| 94 |
-
|
| 95 |
-
## Options to Consider (Decision Made)
|
| 96 |
-
|
| 97 |
-
The recommendation of **Modified Option A (Better Education + Examples)** with slight modification to accordion label was implemented.
|
| 98 |
-
|
| 99 |
-
---
|
| 100 |
-
|
| 101 |
-
## Implementation Notes (Completed)
|
| 102 |
-
|
| 103 |
-
```python
|
| 104 |
-
# From src/app.py
|
| 105 |
-
examples=[
|
| 106 |
-
["What drugs improve female libido post-menopause?", "simple"],
|
| 107 |
-
["Clinical trials for erectile dysfunction alternatives to PDE5 inhibitors?", "advanced"], # Changed
|
| 108 |
-
["Evidence for testosterone therapy in women with HSDD?", "simple"],
|
| 109 |
-
],
|
| 110 |
-
|
| 111 |
-
additional_inputs_accordion=gr.Accordion(
|
| 112 |
-
label="βοΈ Mode & API Key (Free tier works!)", # Changed
|
| 113 |
-
open=False
|
| 114 |
-
),
|
| 115 |
-
|
| 116 |
-
gr.Radio(
|
| 117 |
-
choices=["simple", "advanced"],
|
| 118 |
-
value="simple",
|
| 119 |
-
label="Orchestrator Mode",
|
| 120 |
-
info=( # Changed
|
| 121 |
-
"β‘ Simple: Fast (Free/Any Key) | "
|
| 122 |
-
"π¬ Advanced: Deep Multi-Agent (OpenAI Key Only)"
|
| 123 |
-
),
|
| 124 |
-
),
|
| 125 |
-
```
|
| 126 |
-
|
| 127 |
-
---
|
| 128 |
-
|
| 129 |
-
## Decision Log
|
| 130 |
-
|
| 131 |
-
| Date | Decision | Rationale |
|
| 132 |
-
|------|----------|-----------|
|
| 133 |
-
| 2025-11-28 | Implemented Modified Option A | Minimal changes, high impact on discoverability, graceful fallback, user-approved accordion label. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|