Data Discovery
What is it?
Data Discovery is the capability that allows technical and business users to find, understand, and trust the data available within an organization. It functions as a searchable catalog or map of the data landscape.
It involves collecting technical metadata (schema tables, column types), operational metadata (when was it last refreshed?), and business metadata (definitions, owners). Modern discovery tools often include “Data Lineage,” showing visually where data comes from and where it flows to.
Why is it Important?
- Self-Service: It prevents data teams from being bottlenecks by allowing analysts to find datasets themselves.
- Governance & Compliance: You cannot protect or govern data you don’t know exists. Discovery is essential for GDPR/CCPA compliance.
- Trust: By showing lineage and quality scores, users know if they can rely on a specific table for decision-making.
Well-known Solutions
- Open Source: Lyft Amundsen, LinkedIn DataHub, Apache Atlas.
- Enterprise: Alation, Collibra.
- Cloud Native: AWS Glue Data Catalog, Google Cloud Data Catalog.