We want to build a Public Spark Dashboard showing aggregated data about retrieval measurements performed by Spark checker nodes.
We agreed to aggregate the data with one-day granularity.
What data we want to show?
- Step 1: retrieval success rate (one number - percentage)
- One number per day
- Step 2: RSR per Storage Provider (
minerId
)
- N numbers per day, where N is ~600. We expect this number to grow with no upper bound.
- Later: sky is the limit. See Retrieval Bot Dashboard.
- We may want to show data for each retrieval task performed, where a retrieval task is
(payload CID + the deal it belongs to)
- We may want to show individual measurements are reported by the checker nodes, grouped by the provider
- RSR per Station instance
- RSR per geo-region where Stations are running
- RSR per SP per geo-region of Stations
- Breakdown of retrieval error types - overall + per-SP
- Breakdowns per clientId from the FIL deal
What database to use
Prior art:
- RetrievalBot uses MongoDB to store the data. They are producing multiple orders of magnitude less measurements than SPARK does.
- We used to use Postgres in SPARK, but reached 500GiB limit imposed by Fly.io after less than 3 months.
- spark-evaluate is aggregating data into InfluxDB now. It’s not clear how much more expensive or cheaper is InfluxDB cloud compared to Postgres-on-Fly.
Known issues:
- InfluxDB Cloud sometimes needs 30-100 minutes to make new data available for reading. Aggregation queries building per-day data from per-hour data must take this into account.
InfluxDB docs for downsampling: Downsampling with InfluxDB v2.0
- InfluxDB does not support versioning task queries/scripts in git. We could use infrastructure-as-a-code tool like Terraform, but that’s a lot of additional complexity.
- InfluxDB indexes measurements and tags but does not index values. Tags should not store values from a growing/unbounded set, such as miner IDs. As a result, we cannot get an efficient index-based queries for per-miner stats. (Learn more in InfluxDB docs: Data Layout and Schema Design Best Practices for InfluxDB.)
- If we wanted to expose Postgres to Grafana, we would have to open it up to the entire internet. We have had this setup in the past and had seen hackers trying to guess our admin password and compromise our DB. To stay safe, I prefer to keep the DB hidden and expose the data via REST API - but that requires more work on our side.
Proposal
I am proposing the following architecture. It’s reusing the existing building blocks we are already familiar with.
- Store the public data in a Postgres database.
- This allows us to store rich objects with many properties and define indexes to get performant queries.
- Modify the existing spark-evaluate service to update the public data every round, similarly to how it publishes aggregated per-round metrics to InfluxDB.