Status: Draft
Authors: [email protected]
Stakeholders: List of people whose comments are desired
Last updated: 2022-12-13
As we move towards data pipelines, manually tracking data provenance will become a serious challenge. This document captures various thoughts regarding the use of OpenLineage to automate this process.
Currently, Bacalhau allows you to execute single compute jobs whose outputs are stored on ipfs/filecoin. From a job spec perspective, it’s quite straightforward to understand what artifacts a job generates. Unfortunately, that other way around is not so easy: given some data on IPFS can you tell what job generated that? As we move towards pipelining jobs, tracking down data provenance in chained compute runs will become very hard.
OpenLineage is an open standard for lineage data collection. It tracks metadata about datasets, jobs, and runs. It’s also well integrated with Apache Airflow, the DAG orchestrator we’re using for the Bacalhau pipelines. More info on the Airflow integration can be found here.
While OpenLineage is a standard, Marquez (‣) is a Java implementation of OpenLineage that:
The integration with Airflow is implemented using the openlineage-airflow integration library, which sends OpenLineage events to Marquez.
The DAG metadata collected can answer questions like “What are the upstream dependencies of a DAG?”