An open-source metadata platform for data discovery, governance, and observability across your entire data and AI stack.
DataHub is an open-source metadata platform that acts as a centralized catalog for discovering, governing, and observing data across complex data and AI stacks. It solves the challenge of fragmented data tools by creating a unified, real-time metadata graph that connects warehouses, lakes, BI platforms, and ML systems.
Data engineers, data platform teams, and data stewards at organizations with complex, multi-tool data ecosystems who need to manage metadata at scale.
Developers choose DataHub for its battle-tested scalability from LinkedIn, real-time streaming architecture, extensive connector ecosystem, and native AI agent support via MCP, all available under a permissive Apache 2.0 license.
The Metadata Platform for your Data and AI Stack
Open-Awesome is built by the community, for the community. Submit a project, suggest an awesome list, or help improve the catalog on GitHub.
Metadata updates via Kafka ensure changes propagate in seconds, not hours, enabling live observability and immediate impact analysis, as highlighted in the README's emphasis on streaming-first design.
With 80+ production-grade connectors, it extracts deep metadata like column lineage and usage stats from tools like Snowflake and dbt, reducing integration overhead for complex data stacks.
Built-in Model Context Protocol (MCP) server allows AI coding assistants (e.g., Claude, Cursor) to query metadata via natural language, directly supporting the README's focus on AI-ready contexts.
Battle-tested security, authentication, authorization, and audit trails from LinkedIn ensure compliance and robust policy enforcement for sensitive data management.
Proven to handle millions of data assets and billions of relationships at hyperscale, as demonstrated in production at LinkedIn and other large organizations.
Requires Docker with 8GB+ RAM and multiple services (Kafka, Elasticsearch, MySQL), making local deployment non-trivial and resource-intensive for smaller teams or development environments.
Customizing ingestion recipes and managing the metadata model via YAML/API configurations demands deep data ecosystem knowledge, which can slow onboarding for new users.
While open-source under Apache 2.0, key features and integrations are driven by Acryl Data, potentially prioritizing commercial offerings over community-driven enhancements.
The React UI is functional but may require significant frontend work for brand-specific adjustments or advanced visualizations, as the README focuses more on backend extensibility.