We decided to build SP Retrieval Checker as the first Zinnia module and the first module paying rewards to Station operators. You can find the design discussions in Station Module: SP Retrieval Checker (Spark)
The milestones below outline the engineering work. In parallel to the technical work, we need to answer the business & product questions, especially the most important one: Who is going to fund the FIL pool for paying out rewards?
We need to start working on rewards early on and parallel to the technical work, see M3 Oct 5th: Rewards Alpha
M1 June 13th: Reward-less Walking Skeleton
[19 days = 5 weeks of work]
Build a walking skeleton covering several functional areas. Implement as little functionality as possible while still delivering a meaningful system.
- Spark API - job scheduling (web2-style) [5 days of work]
- Web2-style, let’s get to rewards as quickly as possible.
- A cloud-based Orchestrator service that assigns checker jobs to Station instances. In the future, we should replace this with a smart contract.
- The simplest job selection possible: embed a hard-coded list of
(CID, providerEndpoint, protocol)
into the deployed Orchestrator service and pick a random item whenever we need to schedule a job. (We will improve this in the following iterations.)
- The API implementation writes the randomly selected job into our internal database, together with a timestamp. It generates a unique
JobId
field, stores it in the DB and includes it in the response.
- Let’s use Fly.io because it offers both Server hosting and DB hosting.
- We should implement schema migrations using an automated tool (e.g. postgrator or something else).
- The Station/Zinnia team will operate this orchestrator.
- Setup CI, tests, linting. Setup automatic deployments (CD) on push or tag.
- A Zinnia module to perform the retrieval checks. [5 days of work]
- We can use HTTP API to ask the Orchestrator about the job to perform.
- Retrieve the CID using the IPFS Gateway or Saturn CDN, making HTTP requests via the Fetch API. (We will rework this in the following iterations.)
- Submit retrieval logs & metrics to Ingester
- Look at what Saturn and Rhea/Lassie is including in their logs.
- TTFB, TTLB, error rates, retry count, download speed, etc.
- Talk to Will and Lauren what data will be useful to them. Maybe we can make this dynamically configurable?
- Let’s start with the easy metrics we can get from the Fetch API only.
- When Station reports the job outcome to the Ingester, it includes
JobId
and WalletAddress
in the payload.
- We will need to mock/stub Orchestrator and Ingester for testing.
- No verification of the retrieval content, we will add that via Lassie later.
- The module will be deployed to Filecoin Stations, it will replace the placeholder peer-checker module.
- Improve Zinnia DX for building Station modules [6 days]
- Spark API - Ingester [2 days]
- Add a new HTTP route to our backend monolith app to receive reports about completed jobs. This will share the same DB with Orchestrator/Job Scheduler.
- When Ingester receives a job report from a Station, it attaches the current timestamp to the record stored. This will help us troubleshoot and may be useful later for fraud detection.
- The JobId in the report allows us to verify that Stations performed jobs that our Orchestrator scheduled, as the job report must provide a valid JobId.
-
DevOps & Monitoring [1 day]
- We should build observability into our systems from the beginning. Make sure we are collecting the right data needed to understand what’s happening in the system, and that we are visualising this data in a way that enables the understanding.
I have heard good things about https://www.honeycomb.io, but since PL is heavily invested into Grafana, we can use Grafana too.
- Error monitoring
→ Integrate Sentry
- Do we have strong enough Fly.io instance to handle the load?
→ We can use Fly.io dashboard to see the stats
This may require improvements in Zinnia APIs and Zinnia Runtime
-
Security [already covered]
The trouble: Station installations are permissionless and anonymous. Anybody can run a Station and it’s easy to run thousands to millions of Station instances concurrently. The code of Station Modules is open source, attackers can inspect which HTTP APIs we are calling and call them in an automated way. It’s easy to flood our backend services.
- We need to setup reasonable spending limits on the infrastructure running Orchestrator and Ingester, so that a flood of requests causes DoS but not an astronomical bill to pay.
- This should be already covered, we will rely on spending limits provided by Fly.io.
M2 Jun 30th (+/- 1 week): Lassie Retrievals
[12 days + 6 days for unknown unknowns = 3-5 weeks]
Replace the code making HTTP requests to IPFS Gateway with a retrieval client like Lassie.
Important: Retrieval requests from this module should be indistinguishable from “legit” requests made by other actors in the network (e.g. Saturn). Otherwise SPs can prioritise checker requests over regular traffic.