When you plan your data governance strategy, understanding how data lineage works in Knowledge Catalog can help you make better architectural decisions. Keep the following considerations in mind:
- Project-level activation: When you enable the Data Lineage API, data lineage tracking is active across the entire project by default, automatically reporting information for multiple systems depending on their product-level controls. You can also control lineage ingestion hierarchically for specific services.
- Regulatory compliance: You get clear visibility into data movement while understanding exactly what metadata is recorded and how it's secured.
- Cost management: You can proactively review and manage the billing impact of lineage tracking across your projects.
Product-level lineage controls
When the Data Lineage API is enabled, supported systems report lineage according to their product-level controls:
| System | Available lineage controls |
|---|---|
| BigQuery, Cloud Data Fusion |
You can't restrict lineage tracking to only Cloud Data Fusion or BigQuery when the Data Lineage API is enabled in a project. |
| Managed Service for Apache Airflow | Managed Airflow uses environment-level data lineage integration control. Data lineage is automatically enabled for all new Managed Airflow environments that meet the requirements. See Data lineage with Managed Airflow for more information. For existing environments, use the environment settings to enable or disable data lineage integration. |
| Dataflow | You can capture lineage events with Dataflow jobs and publish them to the Data Lineage API. See Use data lineage in Dataflow for more information. |
| Managed Service for Apache Spark | You can capture lineage events with Managed Service for Apache Spark jobs and publish them to the Data Lineage API. See Using Spark data lineage for more information. |
| Looker (Google Cloud core) (Preview) | Visualizing Looker (Google Cloud core) metadata from BigQuery sources using data lineage is supported. Data lineage must be enabled at the Looker (Google Cloud core) resource level and at the data lineage service level. See Track data lineage with Knowledge Catalog for more information. |
| Vertex AI | Data lineage is automatically enabled for Vertex AI pipelines, tracking input artifacts and execution parameters (such as models, datasets, and components), as well as downstream derived assets. See Track the lineage of pipeline artifacts for more information. |
Billing impact
When you enable the Data Lineage API on a project, review the impact on your billing charges because the Data Lineage API is enabled on a per-project basis (see the previous section for details). For more information about how data lineage is charged, see Knowledge Catalog pricing.
For BigQuery Omni, lineage processing is distributed to specific regions, and costs depend on the regions where the processing is performed.
Data lineage compliance
- Data lineage records metadata about data movement but doesn't capture the data itself. See data lineage information model and Data Lineage API reference for details on what fields are included in the metadata.
- Data lineage as part of Knowledge Catalog offers VPC-SC support.
- Knowledge Catalog doesn't offer the ability to use Customer Managed Encryption Keys to protect the harvested lineage metadata.
Data lineage limitations
When you select a node in the lineage graph, the node details side panel will be empty in the following cases:
- The resource is located in another organization.
- The user isn't a member of the organization hosting the resource.