OpenLineage ←→ Airflow integration

Status: Draft

Stakeholders: List of people whose comments are desired

Last updated: 2022-12-13

Objective

As we move towards data pipelines, manually tracking data provenance will become a serious challenge. This document captures various thoughts regarding the use of OpenLineage to automate this process.

Background

Currently, Bacalhau allows you to execute single compute jobs whose outputs are stored on ipfs/filecoin. From a job spec perspective, it’s quite straightforward to understand what artifacts a job generates. Unfortunately, that other way around is not so easy: given some data on IPFS can you tell what job generated that? As we move towards pipelining jobs, tracking down data provenance in chained compute runs will become very hard.

OpenLineage is an open standard for lineage data collection. It tracks metadata about datasets, jobs, and runs. It’s also well integrated with Apache Airflow, the DAG orchestrator we’re using for the Bacalhau pipelines. More info on the Airflow integration can be found here.

Overview

While OpenLineage is a standard, Marquez (‣) is a Java implementation of OpenLineage that:

On DAG start, collect metadata for each task
Collect task input / output metadata (source, schema, etc)
Collect task run-level metadata (execution time, state, parameters, etc)
On DAG complete, also mark the task as complete in Marquez

The integration with Airflow is implemented using the openlineage-airflow integration library, which sends OpenLineage events to Marquez.

Screenshot 2022-12-13 at 11.43.13.png

The DAG metadata collected can answer questions like “What are the upstream dependencies of a DAG?”