Author: @Walid Baruni
Status: Discussion
Last Updated Date: 2022-12-30
Discussion of Concerns
- Realtime-ness of the logs is a requirement that significantly influences the design decisions and costs.
- Though usually not all data collected needs to be delivered realtime, and for cost reduction it can be worth it to have the edge hosts publish logs twice.
- One for reduced and filtered events for real time processing with less data loss, duplication and durability guarantees, such as DDOS prevention. Kafka or Kinesis can be a good endpoint to deliver these types of logs.
- Another for buffered and compressed logs with higher accuracy and durability guarantees that can be used for archival, auditing or batch processing. Object storage, such as S3 are the best endpoints to deliver such types of logs
- As for the processing of the logs, most certainly a groupBy operation is required as the logs are originally grouped by the source/edge host that generated the logs, which is mostly not that useful.
- They will need to colocate logs across all edge hosts by a different partition key that depends on their application, such as customerId or geographic region.
- This can be done by stream/batch processing frameworks, such as Spark and Flink, or utilize Kafka partitions and Kinesis shards
How Bacalhau Could Be Useful for Log Vending
First, Bacalhau jobs are currently and should always be stateless, transient and short running with a defined timeout period, inputs and outputs, similar to Lambda functions. Maybe better to name these tasks or functions instead of jobs
To support long runnings jobs, other type of services will own triggering the execution of those tasks, keeping track of state/progress, delegating required permissions, and retries if necessary.
These services will be market makers and connectors for different inputs sources. For example:
- Cron Service: A user can register a cron job with this service, which will own finding and executing a periodic task in one of the compute nodes in the network, along with retries if necessary. In the log processing use case, a periodic job can be executed to query S3 for new logs files for further processing
- Streaming Service: A user can register a job to consume from Kafka/Kinesis. The service will trigger multiple tasks in the network to consume from each of Kafka partitions or Kinesis shards, keep track of the checkpoint offsets, and make sure that there is always an active and healthy consumer running, and retry if the consumer times-out or becomes unhealthy. Another implementation would be for the Streaming Service to actually read and buffer from Kafka/Kinesis, store the data in IPFS/S3, and then trigger a Bacalhau task to process the uploaded file. Each approach has their own tradeoffs, and both can co-exist.
- Queueing Service: Similar to the previous one, but consumes from queueing services like SQS and SNS. Implementation wise, the Queuing Service will read few messages and trigger a Bacalhau task with the message values as input.
These are **difficult service to implement and operate, but the anyone can be an orchestration service provider and participate in the network with a new connector and source type. Network access will be required, but I think we are moving towards that already