aamanlamba's picture
Add official hackathon track tag to README
48c82ae

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Lineage Graph Accelerator
emoji: πŸ”₯
colorFrom: purple
colorTo: blue
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: true
license: mit
short_description: AI data lineage extraction & export to data catalogs
tags:
  - data-lineage
  - mcp
  - gradio
  - data-governance
  - dbt
  - airflow
  - etl
  - mcp-in-action-track-productivity
  - hackathon

Lineage Graph Accelerator πŸ”₯

AI-powered data lineage extraction and visualization for modern data platforms

HuggingFace Space License: MIT Gradio

πŸŽ‰ Built for the Gradio Agents & MCP Hackathon - Winter 2025 πŸŽ‰

Celebrating MCP's 1st Birthday! This project demonstrates the power of MCP integration for enterprise data governance.


🌟 What is Lineage Graph Accelerator?

Lineage Graph Accelerator is an AI-powered tool that helps data teams:

  • Extract data lineage from dbt, Airflow, BigQuery, Snowflake, and more
  • Visualize complex data dependencies with interactive Mermaid diagrams
  • Export lineage to enterprise data catalogs (Collibra, Microsoft Purview, Alation)
  • Integrate with MCP servers for enhanced AI-powered processing

Why Data Lineage Matters

Understanding where your data comes from and where it goes is critical for:

  • Data Quality: Track data transformations and identify issues
  • Compliance: Document data flows for GDPR, CCPA, and other regulations
  • Impact Analysis: Understand downstream effects of schema changes
  • Data Discovery: Help analysts find and trust data assets

🎯 Key Features

Multi-Source Support

Source Status Description
dbt Manifest βœ… Parse dbt's manifest.json for model dependencies
Airflow DAG βœ… Extract task dependencies from DAG definitions
SQL DDL βœ… Parse CREATE statements for table lineage
BigQuery βœ… Query INFORMATION_SCHEMA for metadata
Custom JSON βœ… Flexible node/edge format for any source
Snowflake πŸ”„ Coming via MCP integration

Export to Data Catalogs

Catalog Status Format
OpenLineage βœ… Universal open standard
Collibra βœ… Data Intelligence Platform
Microsoft Purview βœ… Azure Data Governance
Alation βœ… Data Catalog
Apache Atlas πŸ”„ Coming soon

Visualization Options

  • Mermaid Diagrams: Interactive, client-side rendering
  • Subgraph Grouping: Organize by data layer (raw, staging, marts)
  • Color-Coded Nodes: Distinguish sources, tables, models, reports
  • Edge Labels: Show transformation types

πŸš€ Quick Start

Try Online (HuggingFace Space)

  1. Visit Lineage Graph Accelerator on HuggingFace
  2. Click "Load Sample" to load example data
  3. Click "Extract Lineage" to see the visualization
  4. Explore the Demo Gallery for more examples

Run Locally

# Clone the repository
git clone https://github.com/YOUR_REPO/lineage-graph-accelerator.git
cd lineage-graph-accelerator

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the app
python app.py

Open http://127.0.0.1:7860 in your browser.


πŸ“– Usage Guide

1. Text/File Metadata Tab

Paste your metadata directly:

{
  "nodes": [
    {"id": "source_db", "type": "source", "name": "Source Database"},
    {"id": "staging", "type": "table", "name": "Staging Table"},
    {"id": "analytics", "type": "table", "name": "Analytics Table"}
  ],
  "edges": [
    {"from": "source_db", "to": "staging"},
    {"from": "staging", "to": "analytics"}
  ]
}

2. Sample Data

Load pre-built samples to explore different scenarios:

  • Simple JSON: Basic node/edge lineage
  • dbt Manifest: Full dbt project with 15+ models
  • Airflow DAG: ETL pipeline with 15 tasks
  • Data Warehouse: Snowflake-style multi-layer architecture
  • ETL Pipeline: Complex multi-source pipeline
  • Complex Demo: 50+ node e-commerce platform

3. Export to Data Catalogs

  1. Extract lineage from your metadata
  2. Expand "Export to Data Catalog"
  3. Select format (OpenLineage, Collibra, Purview, Alation)
  4. Click "Generate Export"
  5. Copy the JSON for import into your catalog

πŸ”Œ MCP Integration

Connect to MCP (Model Context Protocol) servers for enhanced processing:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Lineage Graph  │────▢│   MCP Server    │────▢│   AI Model      β”‚
β”‚   Accelerator   β”‚     β”‚  (HuggingFace)  β”‚     β”‚   (Claude)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Configuration

  1. Expand "MCP Server Configuration" in the UI
  2. Enter your MCP server URL
  3. Add API key (if required)
  4. Click "Test Connection"

Run Local MCP Server

uvicorn mcp_example.server:app --reload --port 9000

Then use http://localhost:9000/mcp as your server URL.


πŸ—οΈ Architecture

flowchart TD
    A[User Interface - Gradio] --> B[Input Parser]
    B --> C{Source Type}
    C -->|dbt| D[dbt Parser]
    C -->|Airflow| E[Airflow Parser]
    C -->|SQL| F[SQL Parser]
    C -->|JSON| G[JSON Parser]
    D & E & F & G --> H[LineageGraph]
    H --> I[Mermaid Generator]
    H --> J[Export Engine]
    I --> K[Visualization]
    J --> L[OpenLineage]
    J --> M[Collibra]
    J --> N[Purview]
    J --> O[Alation]

    subgraph Optional
        P[MCP Server] --> H
    end

Project Structure

lineage-graph-accelerator/
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ exporters/             # Data catalog exporters
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ base.py           # Base classes
β”‚   β”œβ”€β”€ openlineage.py    # OpenLineage format
β”‚   β”œβ”€β”€ collibra.py       # Collibra format
β”‚   β”œβ”€β”€ purview.py        # Microsoft Purview format
β”‚   └── alation.py        # Alation format
β”œβ”€β”€ samples/               # Sample data files
β”‚   β”œβ”€β”€ sample_metadata.json
β”‚   β”œβ”€β”€ dbt_manifest_sample.json
β”‚   β”œβ”€β”€ airflow_dag_sample.json
β”‚   β”œβ”€β”€ sql_ddl_sample.sql
β”‚   β”œβ”€β”€ warehouse_lineage_sample.json
β”‚   β”œβ”€β”€ etl_pipeline_sample.json
β”‚   └── complex_lineage_demo.json
β”œβ”€β”€ mcp_example/          # Example MCP server
β”‚   └── server.py
β”œβ”€β”€ tests/                # Unit tests
β”‚   └── test_app.py
β”œβ”€β”€ memories/             # Agent configuration
β”œβ”€β”€ USER_GUIDE.md         # Comprehensive user guide
β”œβ”€β”€ BUILD_PLAN.md         # Development roadmap
└── requirements.txt

πŸ§ͺ Testing

# Activate virtual environment
source .venv/bin/activate

# Run unit tests
python -m unittest tests.test_app -v

# Run setup validation
python test_setup.py

πŸ“‹ Requirements

  • Python 3.9+
  • Gradio 5.49.1+
  • See requirements.txt for full dependencies

πŸŽ–οΈ Competition Submission

Track: Track 2 - MCP in Action (Productivity)

Team Members:

Judging Criteria Alignment

Criteria Implementation
UI/UX Design Clean, professional interface with tabs, accordions, and color-coded visualizations
Functionality Full MCP integration, multiple input formats, 5 export formats
Creativity Novel approach to data lineage visualization with AI-powered parsing
Documentation Comprehensive README, USER_GUIDE.md, inline comments
Real-world Impact Solves critical enterprise need for data governance and compliance

Demo Video

πŸ“Ί YouTube: Watch the Demo πŸŽ₯ Loom: Alternative Link

Highlights:

  • AI Assistant with Google Gemini generating lineage from natural language
  • MCP Integration with Local Demo server
  • Demo Gallery with 50+ node complex pipelines
  • Export to Collibra, Purview, and Apache Atlas
  • Interactive Mermaid visualizations with zoom and download

Social Media Post

πŸ“± LinkedIn: View the announcement post


πŸ”œ Roadmap

  • Gradio 6 upgrade for enhanced UI components
  • Agentic chatbot for natural language queries (Google Gemini)
  • Apache Atlas export support
  • File upload functionality
  • Graph export as PNG/SVG
  • Batch processing API
  • Column-level lineage

🀝 Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Submit a pull request

See CONTRIBUTING.md for guidelines.


πŸ“„ License

MIT License - see LICENSE for details.


πŸ™ Acknowledgments

  • Anthropic - MCP Protocol and Claude
  • Gradio Team - Amazing UI framework
  • HuggingFace - Hosting and community
  • dbt Labs - Inspiration for metadata standards
  • OpenLineage - Open lineage specification

πŸ“ž Support


Built with ❀️ by Aaman Lamba for the Gradio Agents & MCP Hackathon - Winter 2025
Celebrating MCP's 1st Birthday! πŸŽ‚