Welcome to the Bacalhau Project, a new way to compute, manage, and use data generated anywhere.
Bacalhau is a new platform for distributed computing that helps you manage your parallel processing jobs. Some benefits of using Bacalhau for your compute-over-data process include:
- Process jobs fast: jobs are processed where the data was created (meaning no ingress/egress), and all jobs are parallel by default
- Low cost: it uses the compute that produced the data in the first place, reusing the existing hardware you already have. You also save on any ingress/egress fees you may have been charged.
- More secure: data is not collected in a central location before processing, meaning all scrubbing and security can be applied at the point of collection.
Bacalhau offers all of this and also lets you use your existing tools (such as Python, Javascript, R, and Rust) while also taking advantage of the latest cutting-edge technology, such as WASM and GPU support.
Bacalhau is designed to be (mostly) self-managing. Whether you're running locally, on private clusters, in a data center, or across the distributed web, the system should feel the same. Just write the code to process the data and let Bacalhau take care of the rest!
Long-Term Vision
While we started the project last February, we're already in live production. You can read more on how to use Bacalhau in our official documentation. And while we are just getting started - in the near term, we plan to deliver:
- A simplified job dashboard that lets you see all your jobs in flight
- A rich SDK for Python, Javascript, and Rust
- Job execution pipelines fully compatible with Airflow
- A job zoo that enables you to pick up existing pipelines from the community
- Automatic wrapping with metadata/lineage and transformation for known file types (columnar, video, audio, etc.)
- An on-premises deployment option for private and custom Hardware
- Internode networking for multi-tier applications
- A standard data store that automatically records data and lineage information of jobs
In the long term, our goal is to deliver a complete system that achieves the following:
- A fully distributed data processing system that can run on any device, anywhere
- A declarative pipeline that can both run the data processing and also record the lineage of the data