Author: @Walid Baruni
Status: Discussion
Last Updated Date: 2023-01-16
Introduction
The purpose of this document is to define and suggest the scope and capabilities of Bacalhau jobs that are executed by the compute nodes in the network. Such definition can be helpful in making informed decisions around currently active projects, such as API Spec, Job Spec, DAGs and persistent job state.
Problem Statement
There have been multiple and parallel discussions on the capabilities of Bacalhau Jobs, which will influence the job spec design and add more responsibilities to the compute nodes, including:
- Cron Jobs: the ability to run jobs periodically, which will require nodes to maintain a state about the jobs, allow updating the job spec and query previous executions
- Daemon Jobs: jobs that are expected to always run on certain compute nodes, such as for edge log transformation and vending. This will require compute nodes to retry the job on failure and during startup
- DAGs and Pipelines: A job is defined as a DAG with multiple tasks where outputs of previous tasks will act as inputs of new tasks. This requires a DAG orchestrator that maintains the state of the DAG and knows when and how to trigger the remaining tasks.
- Job Chaining: Similar to DAGs but mentioned separately due to differences in implementation details. There is no need for a DAG orchestrator, and jobs will have links and specs to other jobs to be executed on completion for example. This will require compute nodes to act as requester nodes, discover other compute nodes that are capable of executing the remaining stages and hand over the execution.
- Long Running Jobs: Such as jobs that are processing logs or consuming events from a streaming source, such as Kafka, Kinesis or a Queue. There were couple of ways discussed to achieve this. An approach based on daemon jobs mentioned above that will retry on failure and during startup. Another approach based on dedicated services that maintains the state and trigger time bounded jobs on Bacalhau when the previous job fails or times-out. (ref: Streaming Service: A user can register a job to consume from Kafka/Kinesis. The service will trigger multiple tasks in the network to consume from each of Kafka partitions or Kinesis shards, keep track of the checkpoint offsets, and make sure that there is always an active and healthy consumer running, and retry if the consumer times-out or becomes unhealthy. Another implementation would be for the Streaming Service to actually read and buffer from Kafka/Kinesis, store the data in IPFS/S3, and then trigger a Bacalhau task to process the uploaded file. Each approach has their own tradeoffs, and both can co-exist. )
These are all great features that will add more value to Bacalhau network and enable more use cases. However, if not done properly, these feature (and more in the future) can add significant complexity to Bacalhau compute nodes that will make the network unmaintainable, difficult to operate, debug, and reason about, and not as flexible or general purpose compute network as we intend it to be.
Design Goals
- Ownership: To improve debuggability, traceability and potentially incentive handling of jobs, a submitted job must have a clear owner that is held accountable for the job execution, that must either fail fast if no progress is possible, or retry execution.
- As an example, if a job was submitted directly to a compute node (possible, but not supported today), then the compute node is the job owner.
- On the other hand if the job was submitted to a requester service, then the requester is the job owner and not the selected compute nodes.
- Time bounded: Jobs executed on compute nodes must be time bounded where a node must either complete the execution or timeout eventually. A user, or a requester node, should not assume compute nodes to be highly available, capable of executing long running jobs with no interruptions, or that it will not suddenly disappear!