Enrico, 2022-11-15

📣 UPDATE: the milestones below are tracking the implementation of this plan 👇

M1: Python SDK for easy interaction with Bacalhau from Python · Issue #1183 · filecoin-project/bacalhau

M2: Production-ready Airflow integration · Issue #1176 · filecoin-project/bacalhau

Context

We’ve moved past the 🐟 CoD Summit in Lisbon where the experimental Bacalhau Airflow Provider was demoed live (watch the YouTube video here 🎦). That was the very first prototype of a Bacalhau pipeline running on Apache Airflow, hooray 🥳! For reference, here’s the doc with all thinking on how we got there Initial design doc (Oct.22).

Now, we’re committed to delivering full DAG support by June 2023 and this document contains the planning of the next iteration cycle. New design docs will be linked here as we create them.

High-Level Goals

Next steps

There’re two phases lying ahead.

In Phase 1 we capitalize on the learnings from the first Airflow prototype by making it production-ready, adding a substantial feature such as data lineage and providing a hands-free/simplified way to host Airflow (relieving the user from the burden of running it locally).

In Phase 2 we aim to make DAGs as generic as possible (i.e. orchestrator-agnostic) by simplifying the way the user interacts with them. All they should do is write a standardized pipeline spec and submit it to the Bacalhau network, or even (possibly) to any CoD platform.

In both phases, Bacalhau core does not reinvent the wheel and uses an external DAG orchestrator! Ideally, it’d ship with the primitives to support 1+ orchestrators.

Phase 1 (Nov.22-Jan.23)

Make Bacalhau Airflow Provider stable and publish it to the official Airflow’s Community Provider list (along with k8s, Docker, etc.). This requires a stable Python SDK for Bacalhau. Also, it’d make sense to make Bacalhau aware that job-x is part of a pipeline.