Lineage-graph-accelerator

Running

App Files Files Community

Lineage-graph-accelerator / USER_GUIDE.md

aamanlamba

Phase 2: Enhanced lineage extraction with export to data catalogs

0510038 23 days ago

preview code

raw

history blame contribute delete

13.8 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

Lineage Graph Accelerator - User Guide

A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms.

Getting Started
Input Formats
Sample Lineage Examples
Export to Data Catalogs
MCP Server Integration
Troubleshooting
FAQ

Getting Started

Quick Start (3 Steps)

Open the App: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces
Load Sample Data: Click "Load Sample" to try pre-built examples
Extract Lineage: Click "Extract Lineage" to visualize the data flow

Interface Overview

The application has four main tabs:

Tab	Purpose
Text/File Metadata	Paste or upload metadata directly
BigQuery	Connect to Google BigQuery for schema extraction
URL/API	Fetch metadata from REST APIs
Demo Gallery	One-click demos of various lineage scenarios

Input Formats

The Lineage Graph Accelerator supports multiple metadata formats:

1. Simple JSON (Nodes & Edges)

The simplest format with explicit nodes and edges:

{
  "nodes": [
    {"id": "raw_customers", "type": "table", "name": "raw_customers"},
    {"id": "clean_customers", "type": "table", "name": "clean_customers"},
    {"id": "analytics_customers", "type": "table", "name": "analytics_customers"}
  ],
  "edges": [
    {"from": "raw_customers", "to": "clean_customers"},
    {"from": "clean_customers", "to": "analytics_customers"}
  ]
}

Result: A linear graph showing raw_customers → clean_customers → analytics_customers

2. dbt Manifest Format

Extract lineage from dbt's manifest.json:

{
  "metadata": {
    "dbt_version": "1.7.0",
    "project_name": "my_project"
  },
  "nodes": {
    "source.my_project.raw.customers": {
      "resource_type": "source",
      "name": "customers",
      "schema": "raw"
    },
    "model.my_project.stg_customers": {
      "resource_type": "model",
      "name": "stg_customers",
      "schema": "staging",
      "depends_on": {
        "nodes": ["source.my_project.raw.customers"]
      }
    },
    "model.my_project.dim_customers": {
      "resource_type": "model",
      "name": "dim_customers",
      "schema": "marts",
      "depends_on": {
        "nodes": ["model.my_project.stg_customers"]
      }
    }
  }
}

Result: A graph showing the dbt model dependencies from source to staging to marts.

3. Airflow DAG Format

Extract task dependencies from Airflow DAGs:

{
  "dag_id": "etl_pipeline",
  "tasks": [
    {
      "task_id": "extract_data",
      "operator": "PythonOperator",
      "upstream_dependencies": []
    },
    {
      "task_id": "transform_data",
      "operator": "SparkSubmitOperator",
      "upstream_dependencies": ["extract_data"]
    },
    {
      "task_id": "load_data",
      "operator": "SnowflakeOperator",
      "upstream_dependencies": ["transform_data"]
    }
  ]
}

Result: A DAG visualization showing extract_data → transform_data → load_data

4. Data Warehouse Lineage Format

For Snowflake, BigQuery, or other warehouse lineage:

{
  "warehouse": {
    "platform": "Snowflake",
    "database": "ANALYTICS_DW"
  },
  "lineage": {
    "datasets": [
      {"id": "raw.customers", "type": "table", "schema": "RAW"},
      {"id": "staging.customers", "type": "view", "schema": "STAGING"},
      {"id": "marts.dim_customer", "type": "table", "schema": "MARTS"}
    ],
    "relationships": [
      {"source": "raw.customers", "target": "staging.customers", "type": "transform"},
      {"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"}
    ]
  }
}

5. ETL Pipeline Format

For complex multi-stage ETL pipelines:

{
  "pipeline": {
    "name": "customer_analytics",
    "schedule": "daily"
  },
  "stages": [
    {
      "id": "extract",
      "steps": [
        {"id": "ext_crm", "name": "Extract CRM Data", "inputs": []},
        {"id": "ext_payments", "name": "Extract Payments", "inputs": []}
      ]
    },
    {
      "id": "transform",
      "steps": [
        {"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]}
      ]
    },
    {
      "id": "load",
      "steps": [
        {"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]}
      ]
    }
  ]
}

Sample Lineage Examples

Example 1: Simple E-Commerce Lineage

Scenario: Track data flow from raw transaction data to analytics reports.

Source Systems → Raw Layer → Staging → Data Marts → Reports

Input:

{
  "nodes": [
    {"id": "shopify_api", "type": "source", "name": "Shopify API"},
    {"id": "raw_orders", "type": "table", "name": "raw.orders"},
    {"id": "stg_orders", "type": "model", "name": "staging.stg_orders"},
    {"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"},
    {"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"}
  ],
  "edges": [
    {"from": "shopify_api", "to": "raw_orders", "type": "ingest"},
    {"from": "raw_orders", "to": "stg_orders", "type": "transform"},
    {"from": "stg_orders", "to": "fct_orders", "type": "transform"},
    {"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"}
  ]
}

Expected Output: A Mermaid diagram showing the complete data flow with color-coded nodes by type.

Example 2: Multi-Source Customer 360

Scenario: Combine data from multiple sources to create a unified customer view.

CRM + Payments + Website → Identity Resolution → Customer 360

Input:

{
  "nodes": [
    {"id": "salesforce", "type": "source", "name": "Salesforce CRM"},
    {"id": "stripe", "type": "source", "name": "Stripe Payments"},
    {"id": "ga4", "type": "source", "name": "Google Analytics"},
    {"id": "identity_resolution", "type": "model", "name": "Identity Resolution"},
    {"id": "customer_360", "type": "dimension", "name": "Customer 360"}
  ],
  "edges": [
    {"from": "salesforce", "to": "identity_resolution"},
    {"from": "stripe", "to": "identity_resolution"},
    {"from": "ga4", "to": "identity_resolution"},
    {"from": "identity_resolution", "to": "customer_360"}
  ]
}

Example 3: dbt Project with Multiple Layers

Scenario: A complete dbt project with staging, intermediate, and mart layers.

Load the "dbt Manifest" sample from the dropdown to see a full example with:

4 source tables
4 staging models
2 intermediate models
3 mart tables
2 reporting views

Example 4: Airflow ETL Pipeline

Scenario: A daily ETL pipeline with parallel extraction, sequential transformation, and loading.

Load the "Airflow DAG" sample to see:

Parallel extract tasks
Transform tasks with dependencies
Load tasks to data warehouse
Final notification task

Export to Data Catalogs

The Lineage Graph Accelerator can export lineage to major enterprise data catalogs.

Supported Formats

Format	Platform	Description
OpenLineage	Universal	Open standard, works with Marquez, Atlan, DataHub
Collibra	Collibra Data Intelligence	Enterprise data governance platform
Purview	Microsoft Purview	Azure native data governance
Alation	Alation Data Catalog	Self-service analytics catalog

How to Export

Enter or load your metadata in the Text/File Metadata tab
Extract the lineage to verify it looks correct
Expand "Export to Data Catalog" accordion
Select your format from the dropdown
Click "Generate Export" to create the export file
Copy or download the JSON output

Export Format Details

OpenLineage Export

The OpenLineage export follows the OpenLineage specification:

{
  "producer": "lineage-accelerator",
  "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json",
  "events": [
    {
      "eventType": "COMPLETE",
      "job": {"namespace": "...", "name": "..."},
      "inputs": [...],
      "outputs": [...]
    }
  ]
}

Collibra Export

Ready for Collibra's Import API:

{
  "community": {"name": "Data Lineage"},
  "domain": {"name": "Physical Data Dictionary"},
  "assets": [...],
  "relations": [...]
}

Microsoft Purview Export

Compatible with Purview's bulk import:

{
  "collection": {"referenceName": "lineage-accelerator"},
  "entities": [...],
  "processes": [...]
}

Alation Export

Ready for Alation's bulk upload:

{
  "datasource": {"id": 1, "title": "..."},
  "tables": [...],
  "columns": [...],
  "lineage": [...],
  "dataflows": [...]
}

MCP Server Integration

Connect to external MCP (Model Context Protocol) servers for enhanced processing.

What is MCP?

MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for:

Enhanced lineage extraction with AI
Support for additional metadata formats
Custom processing pipelines

Configuration

Expand "MCP Server Configuration" at the top of the app
Enter the MCP Server URL: e.g., https://your-space.hf.space/mcp
Add API Key (if required)
Click "Test Connection" to verify

Example MCP Servers

Server	URL	Description
Demo Server	`http://localhost:9000/mcp`	Local testing
HuggingFace	`https://your-space.hf.space/mcp`	Production deployment

Running Your Own MCP Server

See mcp_example/server.py for a FastAPI-based MCP server example:

cd mcp_example
uvicorn server:app --reload --port 9000

Troubleshooting

Common Issues

"No data to display"

Cause: The input metadata couldn't be parsed.

Solutions:

Verify your JSON is valid (use a JSON validator)
Check that the format matches one of the supported types
Try loading a sample first to see the expected format

"Export functionality not available"

Cause: The exporters module isn't loaded.

Solutions:

Ensure you're running the latest version
Check that the exporters/ directory exists
Restart the application

MCP Connection Failed

Cause: Cannot reach the MCP server.

Solutions:

Verify the URL is correct
Check if the server is running
Ensure network/firewall allows the connection
Try without the API key first

Mermaid Diagram Not Rendering

Cause: JavaScript loading issue.

Solutions:

Refresh the page
Try a different browser
Check browser console for errors
Ensure JavaScript is enabled

Error Messages

Error	Meaning	Solution
"JSONDecodeError"	Invalid JSON input	Fix JSON syntax
"KeyError"	Missing required field	Check input format
"Timeout"	MCP server slow/unreachable	Increase timeout or check server

FAQ

General Questions

Q: What file formats are supported?

A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats.

Q: Can I upload files?

A: Currently, you need to paste content into the text box. File upload is planned for a future release.

Q: Is my data stored?

A: No. All processing happens in your browser session. No data is stored on servers.

Export Questions

Q: Which export format should I use?

Use OpenLineage for universal compatibility
Use Collibra/Purview/Alation if you use those specific platforms

Q: Can I customize the export?

A: The current exports use default settings. Advanced customization is available through the API.

Technical Questions

Q: What's the maximum graph size?

A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render.

Q: Can I use this programmatically?

A: Yes! See integration_example.py for API usage examples.

Q: Is there a rate limit?

A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance.

Support

Issues: GitHub Issues
Documentation: This guide and README.md
Community: HuggingFace Discussions

Appendix: Complete Sample Data

E-Commerce Platform (Complex)

This sample demonstrates a complete e-commerce analytics platform with:

9 source systems (Shopify, Stripe, GA4, etc.)
50+ nodes across all data layers
80+ lineage relationships
Multiple output destinations (BI tools, reverse ETL)

Load the "Complex Demo" sample to explore the full graph.

Node Types Reference

Type	Color	Description
`source`	Light Blue	External data sources
`table`	Light Green	Database tables
`view`	Light Purple	Database views
`model`	Light Orange	Transformation models
`report`	Light Pink	Reports and dashboards
`dimension`	Cyan	Dimension tables
`fact`	Light Yellow	Fact tables
`destination`	Light Red	Output destinations

Edge Types Reference

Type	Arrow	Description
`transform`	`-->`	Data transformation
`reference`	`-.->`	Reference/lookup
`ingest`	`-->`	Data ingestion
`export`	`-->`	Data export
`join`	`-->`	Table join
`aggregate`	`-->`	Aggregation

Last updated: November 2025 Version: 1.0.0

Lineage Graph Accelerator - User Guide

Table of Contents

Getting Started

Quick Start (3 Steps)

Interface Overview

Input Formats

1. Simple JSON (Nodes & Edges)

2. dbt Manifest Format

3. Airflow DAG Format

4. Data Warehouse Lineage Format

5. ETL Pipeline Format

Sample Lineage Examples

Example 1: Simple E-Commerce Lineage

Example 2: Multi-Source Customer 360

Example 3: dbt Project with Multiple Layers

Example 4: Airflow ETL Pipeline

Export to Data Catalogs

Supported Formats

How to Export

Export Format Details

OpenLineage Export

Collibra Export

Microsoft Purview Export

Alation Export

MCP Server Integration

What is MCP?

Configuration

Example MCP Servers

Running Your Own MCP Server

Troubleshooting

Common Issues

"No data to display"

"Export functionality not available"

MCP Connection Failed

Mermaid Diagram Not Rendering

Error Messages

FAQ

General Questions

Export Questions

Technical Questions

Support

Appendix: Complete Sample Data

E-Commerce Platform (Complex)

Node Types Reference

Edge Types Reference