Lineage-graph-accelerator / USER_GUIDE.md
aamanlamba's picture
Phase 2: Enhanced lineage extraction with export to data catalogs
0510038

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Lineage Graph Accelerator - User Guide

A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms.


Table of Contents

  1. Getting Started
  2. Input Formats
  3. Sample Lineage Examples
  4. Export to Data Catalogs
  5. MCP Server Integration
  6. Troubleshooting
  7. FAQ

Getting Started

Quick Start (3 Steps)

  1. Open the App: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces
  2. Load Sample Data: Click "Load Sample" to try pre-built examples
  3. Extract Lineage: Click "Extract Lineage" to visualize the data flow

Interface Overview

The application has four main tabs:

Tab Purpose
Text/File Metadata Paste or upload metadata directly
BigQuery Connect to Google BigQuery for schema extraction
URL/API Fetch metadata from REST APIs
Demo Gallery One-click demos of various lineage scenarios

Input Formats

The Lineage Graph Accelerator supports multiple metadata formats:

1. Simple JSON (Nodes & Edges)

The simplest format with explicit nodes and edges:

{
  "nodes": [
    {"id": "raw_customers", "type": "table", "name": "raw_customers"},
    {"id": "clean_customers", "type": "table", "name": "clean_customers"},
    {"id": "analytics_customers", "type": "table", "name": "analytics_customers"}
  ],
  "edges": [
    {"from": "raw_customers", "to": "clean_customers"},
    {"from": "clean_customers", "to": "analytics_customers"}
  ]
}

Result: A linear graph showing raw_customers → clean_customers → analytics_customers


2. dbt Manifest Format

Extract lineage from dbt's manifest.json:

{
  "metadata": {
    "dbt_version": "1.7.0",
    "project_name": "my_project"
  },
  "nodes": {
    "source.my_project.raw.customers": {
      "resource_type": "source",
      "name": "customers",
      "schema": "raw"
    },
    "model.my_project.stg_customers": {
      "resource_type": "model",
      "name": "stg_customers",
      "schema": "staging",
      "depends_on": {
        "nodes": ["source.my_project.raw.customers"]
      }
    },
    "model.my_project.dim_customers": {
      "resource_type": "model",
      "name": "dim_customers",
      "schema": "marts",
      "depends_on": {
        "nodes": ["model.my_project.stg_customers"]
      }
    }
  }
}

Result: A graph showing the dbt model dependencies from source to staging to marts.


3. Airflow DAG Format

Extract task dependencies from Airflow DAGs:

{
  "dag_id": "etl_pipeline",
  "tasks": [
    {
      "task_id": "extract_data",
      "operator": "PythonOperator",
      "upstream_dependencies": []
    },
    {
      "task_id": "transform_data",
      "operator": "SparkSubmitOperator",
      "upstream_dependencies": ["extract_data"]
    },
    {
      "task_id": "load_data",
      "operator": "SnowflakeOperator",
      "upstream_dependencies": ["transform_data"]
    }
  ]
}

Result: A DAG visualization showing extract_data → transform_data → load_data


4. Data Warehouse Lineage Format

For Snowflake, BigQuery, or other warehouse lineage:

{
  "warehouse": {
    "platform": "Snowflake",
    "database": "ANALYTICS_DW"
  },
  "lineage": {
    "datasets": [
      {"id": "raw.customers", "type": "table", "schema": "RAW"},
      {"id": "staging.customers", "type": "view", "schema": "STAGING"},
      {"id": "marts.dim_customer", "type": "table", "schema": "MARTS"}
    ],
    "relationships": [
      {"source": "raw.customers", "target": "staging.customers", "type": "transform"},
      {"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"}
    ]
  }
}

5. ETL Pipeline Format

For complex multi-stage ETL pipelines:

{
  "pipeline": {
    "name": "customer_analytics",
    "schedule": "daily"
  },
  "stages": [
    {
      "id": "extract",
      "steps": [
        {"id": "ext_crm", "name": "Extract CRM Data", "inputs": []},
        {"id": "ext_payments", "name": "Extract Payments", "inputs": []}
      ]
    },
    {
      "id": "transform",
      "steps": [
        {"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]}
      ]
    },
    {
      "id": "load",
      "steps": [
        {"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]}
      ]
    }
  ]
}

Sample Lineage Examples

Example 1: Simple E-Commerce Lineage

Scenario: Track data flow from raw transaction data to analytics reports.

Source Systems → Raw Layer → Staging → Data Marts → Reports

Input:

{
  "nodes": [
    {"id": "shopify_api", "type": "source", "name": "Shopify API"},
    {"id": "raw_orders", "type": "table", "name": "raw.orders"},
    {"id": "stg_orders", "type": "model", "name": "staging.stg_orders"},
    {"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"},
    {"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"}
  ],
  "edges": [
    {"from": "shopify_api", "to": "raw_orders", "type": "ingest"},
    {"from": "raw_orders", "to": "stg_orders", "type": "transform"},
    {"from": "stg_orders", "to": "fct_orders", "type": "transform"},
    {"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"}
  ]
}

Expected Output: A Mermaid diagram showing the complete data flow with color-coded nodes by type.


Example 2: Multi-Source Customer 360

Scenario: Combine data from multiple sources to create a unified customer view.

CRM + Payments + Website → Identity Resolution → Customer 360

Input:

{
  "nodes": [
    {"id": "salesforce", "type": "source", "name": "Salesforce CRM"},
    {"id": "stripe", "type": "source", "name": "Stripe Payments"},
    {"id": "ga4", "type": "source", "name": "Google Analytics"},
    {"id": "identity_resolution", "type": "model", "name": "Identity Resolution"},
    {"id": "customer_360", "type": "dimension", "name": "Customer 360"}
  ],
  "edges": [
    {"from": "salesforce", "to": "identity_resolution"},
    {"from": "stripe", "to": "identity_resolution"},
    {"from": "ga4", "to": "identity_resolution"},
    {"from": "identity_resolution", "to": "customer_360"}
  ]
}

Example 3: dbt Project with Multiple Layers

Scenario: A complete dbt project with staging, intermediate, and mart layers.

Load the "dbt Manifest" sample from the dropdown to see a full example with:

  • 4 source tables
  • 4 staging models
  • 2 intermediate models
  • 3 mart tables
  • 2 reporting views

Example 4: Airflow ETL Pipeline

Scenario: A daily ETL pipeline with parallel extraction, sequential transformation, and loading.

Load the "Airflow DAG" sample to see:

  • Parallel extract tasks
  • Transform tasks with dependencies
  • Load tasks to data warehouse
  • Final notification task

Export to Data Catalogs

The Lineage Graph Accelerator can export lineage to major enterprise data catalogs.

Supported Formats

Format Platform Description
OpenLineage Universal Open standard, works with Marquez, Atlan, DataHub
Collibra Collibra Data Intelligence Enterprise data governance platform
Purview Microsoft Purview Azure native data governance
Alation Alation Data Catalog Self-service analytics catalog

How to Export

  1. Enter or load your metadata in the Text/File Metadata tab
  2. Extract the lineage to verify it looks correct
  3. Expand "Export to Data Catalog" accordion
  4. Select your format from the dropdown
  5. Click "Generate Export" to create the export file
  6. Copy or download the JSON output

Export Format Details

OpenLineage Export

The OpenLineage export follows the OpenLineage specification:

{
  "producer": "lineage-accelerator",
  "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
  "events": [
    {
      "eventType": "COMPLETE",
      "job": {"namespace": "...", "name": "..."},
      "inputs": [...],
      "outputs": [...]
    }
  ]
}

Collibra Export

Ready for Collibra's Import API:

{
  "community": {"name": "Data Lineage"},
  "domain": {"name": "Physical Data Dictionary"},
  "assets": [...],
  "relations": [...]
}

Microsoft Purview Export

Compatible with Purview's bulk import:

{
  "collection": {"referenceName": "lineage-accelerator"},
  "entities": [...],
  "processes": [...]
}

Alation Export

Ready for Alation's bulk upload:

{
  "datasource": {"id": 1, "title": "..."},
  "tables": [...],
  "columns": [...],
  "lineage": [...],
  "dataflows": [...]
}

MCP Server Integration

Connect to external MCP (Model Context Protocol) servers for enhanced processing.

What is MCP?

MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for:

  • Enhanced lineage extraction with AI
  • Support for additional metadata formats
  • Custom processing pipelines

Configuration

  1. Expand "MCP Server Configuration" at the top of the app
  2. Enter the MCP Server URL: e.g., https://your-space.hf.space/mcp
  3. Add API Key (if required)
  4. Click "Test Connection" to verify

Example MCP Servers

Server URL Description
Demo Server http://localhost:9000/mcp Local testing
HuggingFace https://your-space.hf.space/mcp Production deployment

Running Your Own MCP Server

See mcp_example/server.py for a FastAPI-based MCP server example:

cd mcp_example
uvicorn server:app --reload --port 9000

Troubleshooting

Common Issues

"No data to display"

Cause: The input metadata couldn't be parsed.

Solutions:

  1. Verify your JSON is valid (use a JSON validator)
  2. Check that the format matches one of the supported types
  3. Try loading a sample first to see the expected format

"Export functionality not available"

Cause: The exporters module isn't loaded.

Solutions:

  1. Ensure you're running the latest version
  2. Check that the exporters/ directory exists
  3. Restart the application

MCP Connection Failed

Cause: Cannot reach the MCP server.

Solutions:

  1. Verify the URL is correct
  2. Check if the server is running
  3. Ensure network/firewall allows the connection
  4. Try without the API key first

Mermaid Diagram Not Rendering

Cause: JavaScript loading issue.

Solutions:

  1. Refresh the page
  2. Try a different browser
  3. Check browser console for errors
  4. Ensure JavaScript is enabled

Error Messages

Error Meaning Solution
"JSONDecodeError" Invalid JSON input Fix JSON syntax
"KeyError" Missing required field Check input format
"Timeout" MCP server slow/unreachable Increase timeout or check server

FAQ

General Questions

Q: What file formats are supported?

A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats.

Q: Can I upload files?

A: Currently, you need to paste content into the text box. File upload is planned for a future release.

Q: Is my data stored?

A: No. All processing happens in your browser session. No data is stored on servers.

Export Questions

Q: Which export format should I use?

A:

  • Use OpenLineage for universal compatibility
  • Use Collibra/Purview/Alation if you use those specific platforms

Q: Can I customize the export?

A: The current exports use default settings. Advanced customization is available through the API.

Technical Questions

Q: What's the maximum graph size?

A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render.

Q: Can I use this programmatically?

A: Yes! See integration_example.py for API usage examples.

Q: Is there a rate limit?

A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance.


Support

  • Issues: GitHub Issues
  • Documentation: This guide and README.md
  • Community: HuggingFace Discussions

Appendix: Complete Sample Data

E-Commerce Platform (Complex)

This sample demonstrates a complete e-commerce analytics platform with:

  • 9 source systems (Shopify, Stripe, GA4, etc.)
  • 50+ nodes across all data layers
  • 80+ lineage relationships
  • Multiple output destinations (BI tools, reverse ETL)

Load the "Complex Demo" sample to explore the full graph.

Node Types Reference

Type Color Description
source Light Blue External data sources
table Light Green Database tables
view Light Purple Database views
model Light Orange Transformation models
report Light Pink Reports and dashboards
dimension Cyan Dimension tables
fact Light Yellow Fact tables
destination Light Red Output destinations

Edge Types Reference

Type Arrow Description
transform --> Data transformation
reference -.-> Reference/lookup
ingest --> Data ingestion
export --> Data export
join --> Table join
aggregate --> Aggregation

Last updated: November 2025 Version: 1.0.0