Author: @Walid Baruni

Status: Finalized

Last Updated Date: 10-31-2022

Introduction

Bacalhau is a Compute over Data network when users can submit jobs that get executed on the network by a set of participating trust-less compute nodes. Bacalhau is still in the early stages of development and there are some areas that needs to be disambiguated, including how to orchestrate jobs reliably, how to scale out as more nodes join the network and how to store state about submitted jobs.

Note: This document might cover too much ground and context than what might be needed. However, I believe such context is helpful for reviewers outside of Bacalhau to give better feedback, and can be good for new hires to get up to speed.

Current Architecture

sequenceDiagram
	actor U as User
	participant R as RequesterNode
	participant C1 as ComputeNode1
	participant C2 as ComputeNode2
	U->>+R: SubmitJob
	R->>-U: ok (jobID)
	Note over R,C2: Job Publishing
  R-->>C1: JobCreated
	C1-->>C2: JobCreated
	Note over R,C2: Bidding Process
	C1-->>R: C1.Bid
	C2-->>C1: C2.Bid
	C1-->>R: C2.Bid
	R-->>C1: C1.BidAccepted
	R-->>C1: C2.BidRejected
	C1-->>C2: C2.BidRejected
	Note over R,C2: Result Verification & Pub
	C1-->>R: ResultProposal
	C1-->>C2: ResultProposal
	R-->>C1: ResultVerified
	C1-->>C2: ResultVerified
	C1-->>R: ResultPublished
	C1-->>C2: ResultPublished
	U->>+R: Get(jobID)
	R->>-U: Result

Network Communication

This is oversimplified view of what's is happening behind the scenes, and what to clarify in the diagram is that requester node and compute nodes communicate with each other through pubsub using GossipSub protocol (dotted arrows) and not through direct communication. This means that all messages are broadcasted to the whole network, even to nodes that are not interested in hearing about job updates.

Each node will receive a copy of the message, it will decrypt and deserialize it, but only process it if the NodeID matches the TargetNodeID encoded in the message.

Sharding and Concurrency

In addition to that, a single job can be executed on multiple compute nodes and a requester node can accept multiple bids if the following parameters are defined in the job spec:

  1. Sharding: The user can define sharding logic which allows splitting the input data across multiple shards, where each shard can be executed on separate nodes and go through its own lifecycle of bidding, verification and result publishing.
  2. Concurrency: The user can define how many different nodes to run the exact same job for verification purposes, which enables verification of deterministic jobs. Note: A more accurate name could’ve been Replication since the exact same job and input is executed on multiple nodes, and has nothing to do with parallelism or concurrent execution.