A newer version of the Gradio SDK is available:
6.1.0
Lineage Graph Accelerator - User Guide
A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms.
Table of Contents
- Getting Started
- Input Formats
- Sample Lineage Examples
- Export to Data Catalogs
- MCP Server Integration
- Troubleshooting
- FAQ
Getting Started
Quick Start (3 Steps)
- Open the App: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces
- Load Sample Data: Click "Load Sample" to try pre-built examples
- Extract Lineage: Click "Extract Lineage" to visualize the data flow
Interface Overview
The application has four main tabs:
| Tab | Purpose |
|---|---|
| Text/File Metadata | Paste or upload metadata directly |
| BigQuery | Connect to Google BigQuery for schema extraction |
| URL/API | Fetch metadata from REST APIs |
| Demo Gallery | One-click demos of various lineage scenarios |
Input Formats
The Lineage Graph Accelerator supports multiple metadata formats:
1. Simple JSON (Nodes & Edges)
The simplest format with explicit nodes and edges:
{
"nodes": [
{"id": "raw_customers", "type": "table", "name": "raw_customers"},
{"id": "clean_customers", "type": "table", "name": "clean_customers"},
{"id": "analytics_customers", "type": "table", "name": "analytics_customers"}
],
"edges": [
{"from": "raw_customers", "to": "clean_customers"},
{"from": "clean_customers", "to": "analytics_customers"}
]
}
Result: A linear graph showing raw_customers → clean_customers → analytics_customers
2. dbt Manifest Format
Extract lineage from dbt's manifest.json:
{
"metadata": {
"dbt_version": "1.7.0",
"project_name": "my_project"
},
"nodes": {
"source.my_project.raw.customers": {
"resource_type": "source",
"name": "customers",
"schema": "raw"
},
"model.my_project.stg_customers": {
"resource_type": "model",
"name": "stg_customers",
"schema": "staging",
"depends_on": {
"nodes": ["source.my_project.raw.customers"]
}
},
"model.my_project.dim_customers": {
"resource_type": "model",
"name": "dim_customers",
"schema": "marts",
"depends_on": {
"nodes": ["model.my_project.stg_customers"]
}
}
}
}
Result: A graph showing the dbt model dependencies from source to staging to marts.
3. Airflow DAG Format
Extract task dependencies from Airflow DAGs:
{
"dag_id": "etl_pipeline",
"tasks": [
{
"task_id": "extract_data",
"operator": "PythonOperator",
"upstream_dependencies": []
},
{
"task_id": "transform_data",
"operator": "SparkSubmitOperator",
"upstream_dependencies": ["extract_data"]
},
{
"task_id": "load_data",
"operator": "SnowflakeOperator",
"upstream_dependencies": ["transform_data"]
}
]
}
Result: A DAG visualization showing extract_data → transform_data → load_data
4. Data Warehouse Lineage Format
For Snowflake, BigQuery, or other warehouse lineage:
{
"warehouse": {
"platform": "Snowflake",
"database": "ANALYTICS_DW"
},
"lineage": {
"datasets": [
{"id": "raw.customers", "type": "table", "schema": "RAW"},
{"id": "staging.customers", "type": "view", "schema": "STAGING"},
{"id": "marts.dim_customer", "type": "table", "schema": "MARTS"}
],
"relationships": [
{"source": "raw.customers", "target": "staging.customers", "type": "transform"},
{"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"}
]
}
}
5. ETL Pipeline Format
For complex multi-stage ETL pipelines:
{
"pipeline": {
"name": "customer_analytics",
"schedule": "daily"
},
"stages": [
{
"id": "extract",
"steps": [
{"id": "ext_crm", "name": "Extract CRM Data", "inputs": []},
{"id": "ext_payments", "name": "Extract Payments", "inputs": []}
]
},
{
"id": "transform",
"steps": [
{"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]}
]
},
{
"id": "load",
"steps": [
{"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]}
]
}
]
}
Sample Lineage Examples
Example 1: Simple E-Commerce Lineage
Scenario: Track data flow from raw transaction data to analytics reports.
Source Systems → Raw Layer → Staging → Data Marts → Reports
Input:
{
"nodes": [
{"id": "shopify_api", "type": "source", "name": "Shopify API"},
{"id": "raw_orders", "type": "table", "name": "raw.orders"},
{"id": "stg_orders", "type": "model", "name": "staging.stg_orders"},
{"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"},
{"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"}
],
"edges": [
{"from": "shopify_api", "to": "raw_orders", "type": "ingest"},
{"from": "raw_orders", "to": "stg_orders", "type": "transform"},
{"from": "stg_orders", "to": "fct_orders", "type": "transform"},
{"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"}
]
}
Expected Output: A Mermaid diagram showing the complete data flow with color-coded nodes by type.
Example 2: Multi-Source Customer 360
Scenario: Combine data from multiple sources to create a unified customer view.
CRM + Payments + Website → Identity Resolution → Customer 360
Input:
{
"nodes": [
{"id": "salesforce", "type": "source", "name": "Salesforce CRM"},
{"id": "stripe", "type": "source", "name": "Stripe Payments"},
{"id": "ga4", "type": "source", "name": "Google Analytics"},
{"id": "identity_resolution", "type": "model", "name": "Identity Resolution"},
{"id": "customer_360", "type": "dimension", "name": "Customer 360"}
],
"edges": [
{"from": "salesforce", "to": "identity_resolution"},
{"from": "stripe", "to": "identity_resolution"},
{"from": "ga4", "to": "identity_resolution"},
{"from": "identity_resolution", "to": "customer_360"}
]
}
Example 3: dbt Project with Multiple Layers
Scenario: A complete dbt project with staging, intermediate, and mart layers.
Load the "dbt Manifest" sample from the dropdown to see a full example with:
- 4 source tables
- 4 staging models
- 2 intermediate models
- 3 mart tables
- 2 reporting views
Example 4: Airflow ETL Pipeline
Scenario: A daily ETL pipeline with parallel extraction, sequential transformation, and loading.
Load the "Airflow DAG" sample to see:
- Parallel extract tasks
- Transform tasks with dependencies
- Load tasks to data warehouse
- Final notification task
Export to Data Catalogs
The Lineage Graph Accelerator can export lineage to major enterprise data catalogs.
Supported Formats
| Format | Platform | Description |
|---|---|---|
| OpenLineage | Universal | Open standard, works with Marquez, Atlan, DataHub |
| Collibra | Collibra Data Intelligence | Enterprise data governance platform |
| Purview | Microsoft Purview | Azure native data governance |
| Alation | Alation Data Catalog | Self-service analytics catalog |
How to Export
- Enter or load your metadata in the Text/File Metadata tab
- Extract the lineage to verify it looks correct
- Expand "Export to Data Catalog" accordion
- Select your format from the dropdown
- Click "Generate Export" to create the export file
- Copy or download the JSON output
Export Format Details
OpenLineage Export
The OpenLineage export follows the OpenLineage specification:
{
"producer": "lineage-accelerator",
"schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
"events": [
{
"eventType": "COMPLETE",
"job": {"namespace": "...", "name": "..."},
"inputs": [...],
"outputs": [...]
}
]
}
Collibra Export
Ready for Collibra's Import API:
{
"community": {"name": "Data Lineage"},
"domain": {"name": "Physical Data Dictionary"},
"assets": [...],
"relations": [...]
}
Microsoft Purview Export
Compatible with Purview's bulk import:
{
"collection": {"referenceName": "lineage-accelerator"},
"entities": [...],
"processes": [...]
}
Alation Export
Ready for Alation's bulk upload:
{
"datasource": {"id": 1, "title": "..."},
"tables": [...],
"columns": [...],
"lineage": [...],
"dataflows": [...]
}
MCP Server Integration
Connect to external MCP (Model Context Protocol) servers for enhanced processing.
What is MCP?
MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for:
- Enhanced lineage extraction with AI
- Support for additional metadata formats
- Custom processing pipelines
Configuration
- Expand "MCP Server Configuration" at the top of the app
- Enter the MCP Server URL: e.g.,
https://your-space.hf.space/mcp - Add API Key (if required)
- Click "Test Connection" to verify
Example MCP Servers
| Server | URL | Description |
|---|---|---|
| Demo Server | http://localhost:9000/mcp |
Local testing |
| HuggingFace | https://your-space.hf.space/mcp |
Production deployment |
Running Your Own MCP Server
See mcp_example/server.py for a FastAPI-based MCP server example:
cd mcp_example
uvicorn server:app --reload --port 9000
Troubleshooting
Common Issues
"No data to display"
Cause: The input metadata couldn't be parsed.
Solutions:
- Verify your JSON is valid (use a JSON validator)
- Check that the format matches one of the supported types
- Try loading a sample first to see the expected format
"Export functionality not available"
Cause: The exporters module isn't loaded.
Solutions:
- Ensure you're running the latest version
- Check that the
exporters/directory exists - Restart the application
MCP Connection Failed
Cause: Cannot reach the MCP server.
Solutions:
- Verify the URL is correct
- Check if the server is running
- Ensure network/firewall allows the connection
- Try without the API key first
Mermaid Diagram Not Rendering
Cause: JavaScript loading issue.
Solutions:
- Refresh the page
- Try a different browser
- Check browser console for errors
- Ensure JavaScript is enabled
Error Messages
| Error | Meaning | Solution |
|---|---|---|
| "JSONDecodeError" | Invalid JSON input | Fix JSON syntax |
| "KeyError" | Missing required field | Check input format |
| "Timeout" | MCP server slow/unreachable | Increase timeout or check server |
FAQ
General Questions
Q: What file formats are supported?
A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats.
Q: Can I upload files?
A: Currently, you need to paste content into the text box. File upload is planned for a future release.
Q: Is my data stored?
A: No. All processing happens in your browser session. No data is stored on servers.
Export Questions
Q: Which export format should I use?
A:
- Use OpenLineage for universal compatibility
- Use Collibra/Purview/Alation if you use those specific platforms
Q: Can I customize the export?
A: The current exports use default settings. Advanced customization is available through the API.
Technical Questions
Q: What's the maximum graph size?
A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render.
Q: Can I use this programmatically?
A: Yes! See integration_example.py for API usage examples.
Q: Is there a rate limit?
A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance.
Support
- Issues: GitHub Issues
- Documentation: This guide and README.md
- Community: HuggingFace Discussions
Appendix: Complete Sample Data
E-Commerce Platform (Complex)
This sample demonstrates a complete e-commerce analytics platform with:
- 9 source systems (Shopify, Stripe, GA4, etc.)
- 50+ nodes across all data layers
- 80+ lineage relationships
- Multiple output destinations (BI tools, reverse ETL)
Load the "Complex Demo" sample to explore the full graph.
Node Types Reference
| Type | Color | Description |
|---|---|---|
source |
Light Blue | External data sources |
table |
Light Green | Database tables |
view |
Light Purple | Database views |
model |
Light Orange | Transformation models |
report |
Light Pink | Reports and dashboards |
dimension |
Cyan | Dimension tables |
fact |
Light Yellow | Fact tables |
destination |
Light Red | Output destinations |
Edge Types Reference
| Type | Arrow | Description |
|---|---|---|
transform |
--> |
Data transformation |
reference |
-.-> |
Reference/lookup |
ingest |
--> |
Data ingestion |
export |
--> |
Data export |
join |
--> |
Table join |
aggregate |
--> |
Aggregation |
Last updated: November 2025 Version: 1.0.0