# Lineage Graph Accelerator - User Guide A comprehensive guide to using the Lineage Graph Accelerator for extracting, visualizing, and exporting data lineage from your data platforms. --- ## Table of Contents 1. [Getting Started](#getting-started) 2. [Input Formats](#input-formats) 3. [Sample Lineage Examples](#sample-lineage-examples) 4. [Export to Data Catalogs](#export-to-data-catalogs) 5. [MCP Server Integration](#mcp-server-integration) 6. [Troubleshooting](#troubleshooting) 7. [FAQ](#faq) --- ## Getting Started ### Quick Start (3 Steps) 1. **Open the App**: Navigate to the Lineage Graph Accelerator on HuggingFace Spaces 2. **Load Sample Data**: Click "Load Sample" to try pre-built examples 3. **Extract Lineage**: Click "Extract Lineage" to visualize the data flow ### Interface Overview The application has four main tabs: | Tab | Purpose | |-----|---------| | **Text/File Metadata** | Paste or upload metadata directly | | **BigQuery** | Connect to Google BigQuery for schema extraction | | **URL/API** | Fetch metadata from REST APIs | | **Demo Gallery** | One-click demos of various lineage scenarios | --- ## Input Formats The Lineage Graph Accelerator supports multiple metadata formats: ### 1. Simple JSON (Nodes & Edges) The simplest format with explicit nodes and edges: ```json { "nodes": [ {"id": "raw_customers", "type": "table", "name": "raw_customers"}, {"id": "clean_customers", "type": "table", "name": "clean_customers"}, {"id": "analytics_customers", "type": "table", "name": "analytics_customers"} ], "edges": [ {"from": "raw_customers", "to": "clean_customers"}, {"from": "clean_customers", "to": "analytics_customers"} ] } ``` **Result**: A linear graph showing `raw_customers → clean_customers → analytics_customers` --- ### 2. dbt Manifest Format Extract lineage from dbt's `manifest.json`: ```json { "metadata": { "dbt_version": "1.7.0", "project_name": "my_project" }, "nodes": { "source.my_project.raw.customers": { "resource_type": "source", "name": "customers", "schema": "raw" }, "model.my_project.stg_customers": { "resource_type": "model", "name": "stg_customers", "schema": "staging", "depends_on": { "nodes": ["source.my_project.raw.customers"] } }, "model.my_project.dim_customers": { "resource_type": "model", "name": "dim_customers", "schema": "marts", "depends_on": { "nodes": ["model.my_project.stg_customers"] } } } } ``` **Result**: A graph showing the dbt model dependencies from source to staging to marts. --- ### 3. Airflow DAG Format Extract task dependencies from Airflow DAGs: ```json { "dag_id": "etl_pipeline", "tasks": [ { "task_id": "extract_data", "operator": "PythonOperator", "upstream_dependencies": [] }, { "task_id": "transform_data", "operator": "SparkSubmitOperator", "upstream_dependencies": ["extract_data"] }, { "task_id": "load_data", "operator": "SnowflakeOperator", "upstream_dependencies": ["transform_data"] } ] } ``` **Result**: A DAG visualization showing `extract_data → transform_data → load_data` --- ### 4. Data Warehouse Lineage Format For Snowflake, BigQuery, or other warehouse lineage: ```json { "warehouse": { "platform": "Snowflake", "database": "ANALYTICS_DW" }, "lineage": { "datasets": [ {"id": "raw.customers", "type": "table", "schema": "RAW"}, {"id": "staging.customers", "type": "view", "schema": "STAGING"}, {"id": "marts.dim_customer", "type": "table", "schema": "MARTS"} ], "relationships": [ {"source": "raw.customers", "target": "staging.customers", "type": "transform"}, {"source": "staging.customers", "target": "marts.dim_customer", "type": "transform"} ] } } ``` --- ### 5. ETL Pipeline Format For complex multi-stage ETL pipelines: ```json { "pipeline": { "name": "customer_analytics", "schedule": "daily" }, "stages": [ { "id": "extract", "steps": [ {"id": "ext_crm", "name": "Extract CRM Data", "inputs": []}, {"id": "ext_payments", "name": "Extract Payments", "inputs": []} ] }, { "id": "transform", "steps": [ {"id": "tfm_customers", "name": "Transform Customers", "inputs": ["ext_crm", "ext_payments"]} ] }, { "id": "load", "steps": [ {"id": "load_warehouse", "name": "Load to Warehouse", "inputs": ["tfm_customers"]} ] } ] } ``` --- ## Sample Lineage Examples ### Example 1: Simple E-Commerce Lineage **Scenario**: Track data flow from raw transaction data to analytics reports. ``` Source Systems → Raw Layer → Staging → Data Marts → Reports ``` **Input**: ```json { "nodes": [ {"id": "shopify_api", "type": "source", "name": "Shopify API"}, {"id": "raw_orders", "type": "table", "name": "raw.orders"}, {"id": "stg_orders", "type": "model", "name": "staging.stg_orders"}, {"id": "fct_orders", "type": "fact", "name": "marts.fct_orders"}, {"id": "rpt_daily_sales", "type": "report", "name": "Daily Sales Report"} ], "edges": [ {"from": "shopify_api", "to": "raw_orders", "type": "ingest"}, {"from": "raw_orders", "to": "stg_orders", "type": "transform"}, {"from": "stg_orders", "to": "fct_orders", "type": "transform"}, {"from": "fct_orders", "to": "rpt_daily_sales", "type": "aggregate"} ] } ``` **Expected Output**: A Mermaid diagram showing the complete data flow with color-coded nodes by type. --- ### Example 2: Multi-Source Customer 360 **Scenario**: Combine data from multiple sources to create a unified customer view. ``` CRM + Payments + Website → Identity Resolution → Customer 360 ``` **Input**: ```json { "nodes": [ {"id": "salesforce", "type": "source", "name": "Salesforce CRM"}, {"id": "stripe", "type": "source", "name": "Stripe Payments"}, {"id": "ga4", "type": "source", "name": "Google Analytics"}, {"id": "identity_resolution", "type": "model", "name": "Identity Resolution"}, {"id": "customer_360", "type": "dimension", "name": "Customer 360"} ], "edges": [ {"from": "salesforce", "to": "identity_resolution"}, {"from": "stripe", "to": "identity_resolution"}, {"from": "ga4", "to": "identity_resolution"}, {"from": "identity_resolution", "to": "customer_360"} ] } ``` --- ### Example 3: dbt Project with Multiple Layers **Scenario**: A complete dbt project with staging, intermediate, and mart layers. Load the "dbt Manifest" sample from the dropdown to see a full example with: - 4 source tables - 4 staging models - 2 intermediate models - 3 mart tables - 2 reporting views --- ### Example 4: Airflow ETL Pipeline **Scenario**: A daily ETL pipeline with parallel extraction, sequential transformation, and loading. Load the "Airflow DAG" sample to see: - Parallel extract tasks - Transform tasks with dependencies - Load tasks to data warehouse - Final notification task --- ## Export to Data Catalogs The Lineage Graph Accelerator can export lineage to major enterprise data catalogs. ### Supported Formats | Format | Platform | Description | |--------|----------|-------------| | **OpenLineage** | Universal | Open standard, works with Marquez, Atlan, DataHub | | **Collibra** | Collibra Data Intelligence | Enterprise data governance platform | | **Purview** | Microsoft Purview | Azure native data governance | | **Alation** | Alation Data Catalog | Self-service analytics catalog | ### How to Export 1. **Enter or load your metadata** in the Text/File Metadata tab 2. **Extract the lineage** to verify it looks correct 3. **Expand "Export to Data Catalog"** accordion 4. **Select your format** from the dropdown 5. **Click "Generate Export"** to create the export file 6. **Copy or download** the JSON output ### Export Format Details #### OpenLineage Export The OpenLineage export follows the [OpenLineage specification](https://openlineage.io/): ```json { "producer": "lineage-accelerator", "schemaURL": "https://openlineage.io/spec/1-0-0/OpenLineage.json", "events": [ { "eventType": "COMPLETE", "job": {"namespace": "...", "name": "..."}, "inputs": [...], "outputs": [...] } ] } ``` #### Collibra Export Ready for Collibra's Import API: ```json { "community": {"name": "Data Lineage"}, "domain": {"name": "Physical Data Dictionary"}, "assets": [...], "relations": [...] } ``` #### Microsoft Purview Export Compatible with Purview's bulk import: ```json { "collection": {"referenceName": "lineage-accelerator"}, "entities": [...], "processes": [...] } ``` #### Alation Export Ready for Alation's bulk upload: ```json { "datasource": {"id": 1, "title": "..."}, "tables": [...], "columns": [...], "lineage": [...], "dataflows": [...] } ``` --- ## MCP Server Integration Connect to external MCP (Model Context Protocol) servers for enhanced processing. ### What is MCP? MCP (Model Context Protocol) is a standard for AI model integration. The Lineage Graph Accelerator can connect to MCP servers hosted on HuggingFace Spaces for: - Enhanced lineage extraction with AI - Support for additional metadata formats - Custom processing pipelines ### Configuration 1. **Expand "MCP Server Configuration"** at the top of the app 2. **Enter the MCP Server URL**: e.g., `https://your-space.hf.space/mcp` 3. **Add API Key** (if required) 4. **Click "Test Connection"** to verify ### Example MCP Servers | Server | URL | Description | |--------|-----|-------------| | Demo Server | `http://localhost:9000/mcp` | Local testing | | HuggingFace | `https://your-space.hf.space/mcp` | Production deployment | ### Running Your Own MCP Server See `mcp_example/server.py` for a FastAPI-based MCP server example: ```bash cd mcp_example uvicorn server:app --reload --port 9000 ``` --- ## Troubleshooting ### Common Issues #### "No data to display" **Cause**: The input metadata couldn't be parsed. **Solutions**: 1. Verify your JSON is valid (use a JSON validator) 2. Check that the format matches one of the supported types 3. Try loading a sample first to see the expected format #### "Export functionality not available" **Cause**: The exporters module isn't loaded. **Solutions**: 1. Ensure you're running the latest version 2. Check that the `exporters/` directory exists 3. Restart the application #### MCP Connection Failed **Cause**: Cannot reach the MCP server. **Solutions**: 1. Verify the URL is correct 2. Check if the server is running 3. Ensure network/firewall allows the connection 4. Try without the API key first #### Mermaid Diagram Not Rendering **Cause**: JavaScript loading issue. **Solutions**: 1. Refresh the page 2. Try a different browser 3. Check browser console for errors 4. Ensure JavaScript is enabled ### Error Messages | Error | Meaning | Solution | |-------|---------|----------| | "JSONDecodeError" | Invalid JSON input | Fix JSON syntax | | "KeyError" | Missing required field | Check input format | | "Timeout" | MCP server slow/unreachable | Increase timeout or check server | --- ## FAQ ### General Questions **Q: What file formats are supported?** A: JSON is the primary format. We also support SQL DDL (with limitations) and can parse dbt manifests, Airflow DAGs, and custom formats. **Q: Can I upload files?** A: Currently, you need to paste content into the text box. File upload is planned for a future release. **Q: Is my data stored?** A: No. All processing happens in your browser session. No data is stored on servers. ### Export Questions **Q: Which export format should I use?** A: - Use **OpenLineage** for universal compatibility - Use **Collibra/Purview/Alation** if you use those specific platforms **Q: Can I customize the export?** A: The current exports use default settings. Advanced customization is available through the API. ### Technical Questions **Q: What's the maximum graph size?** A: The UI handles graphs up to ~500 nodes smoothly. Larger graphs may be slow to render. **Q: Can I use this programmatically?** A: Yes! See `integration_example.py` for API usage examples. **Q: Is there a rate limit?** A: The HuggingFace Space has standard rate limits. For heavy usage, deploy your own instance. --- ## Support - **Issues**: [GitHub Issues](https://github.com/your-repo/issues) - **Documentation**: This guide and README.md - **Community**: HuggingFace Discussions --- ## Appendix: Complete Sample Data ### E-Commerce Platform (Complex) This sample demonstrates a complete e-commerce analytics platform with: - 9 source systems (Shopify, Stripe, GA4, etc.) - 50+ nodes across all data layers - 80+ lineage relationships - Multiple output destinations (BI tools, reverse ETL) Load the "Complex Demo" sample to explore the full graph. ### Node Types Reference | Type | Color | Description | |------|-------|-------------| | `source` | Light Blue | External data sources | | `table` | Light Green | Database tables | | `view` | Light Purple | Database views | | `model` | Light Orange | Transformation models | | `report` | Light Pink | Reports and dashboards | | `dimension` | Cyan | Dimension tables | | `fact` | Light Yellow | Fact tables | | `destination` | Light Red | Output destinations | ### Edge Types Reference | Type | Arrow | Description | |------|-------|-------------| | `transform` | `-->` | Data transformation | | `reference` | `-.->` | Reference/lookup | | `ingest` | `-->` | Data ingestion | | `export` | `-->` | Data export | | `join` | `-->` | Table join | | `aggregate` | `-->` | Aggregation | --- *Last updated: November 2025* *Version: 1.0.0*