This document is a collection of notes and ideas for modifications to Balahau to allow the management of compute jobs on baremetal, including the running of a completely content-addressable operating system stack.
Whilst this document is specifically targetting baremetal servers, the proposed design/implementation also works for virtual machines (with any hypervisor), as well as web2 public cloud instances (for testing and early adoption).
Were Bacalhau, or a complementary component, be able to manage the whole software stack and compute jobs from baremetal up, we would see many benefits. If the operating system and core services (Docker for example) were all content-addressable, stateless images, the value would be even more profound. At a high level, these benefits might include
To help illustrate the value, the following user stories show potential workflows.
The above user stories show how straightforward it is to execute a job and have the whole software stack content-addressable and ephemeral. It also shows how easy it is for a user to change the role of a node from a Docker compute node to a Python compute node.
Bacalhau today is capable of taking self-contained userspace jobs, such as Docker containers, and distributing them across multiple nodes to be executed and the output gathered.
Bacalhau does this by having a running agent on each participating node. A user can submit a job, which is then passed to the node to be executed. On completion, the output of the job is then retrieved.
This model works for many applications and use cases, including other container runtimes (OCI for example), WASM, standalone binaries or Python scripts.
This approach assumes that the nodes are already deployed and running an operating system, and assumes that the upfront deployment and ongoing operational maintenance is handled elsewhere. In various forums, questions have been raised about the possibility of Bacalhau being able to also manage the bootstrap and runtime of jobs on baremetal.