CID Sampling for SPARK

The first step in the SPARK retrieval check workflow is the selection of (cid, address) pair that the given checker should test.

We have the following requirements:

The distribution should be uniform.
- There should be equal probability that any given (cid, address) is chosen
- There should be equal probability that any given checker will be assigned to test the given (cid, address) job.
  - There is an exception: we need to build a Honest Majority of checkers retrieving the same cid from the same address in a single measurement epoch. The sampling algorithm must be able to account for this.
The sampling must not be predictable. SPs must not be able to predict what CIDs will be checked in the next round.
Neither SPs nor Checkers can influence what (cid, address, checker) triple is sampled.
We want to use IPv4 address blocks as a scarce resource preventing any single party from spinning up a large number of nodes and controlling a large portion of the network.
- Nodes register themselves with the orchestrator; that’s how we find their IPv4.
  - Later (smart-contracts) - create some sort of an oracle that provides node’s IP.
The system must allow 3rd parties to verify that (CID, address) samples were chosen correctly.

Additional thoughts:

If Content Producers are paying for retrieval checks for the set of CIDs they are publishing, then we need to limit the set of CIDs only to the CIDs reported by this particular Content Producer, as opposed to doing a random walk of IPNI advertisements and/or Boost deals.
Retrieval Bot uses the following algorithm for sampling FIL+ deals: we actually end up tracking active deals in a db and run sampling based on whats in there periodically / looking to spread across as many cid/sp id pairs as possible, rather than generating new tests every epoch. https://github.com/data-preservation-programs/RetrievalBot/blob/1bf7e9520f2445ccf9a98a033a08fd4e6f6701f6/filplus.md See also https://medium.com/filecoin-plus/retrieval-bot-is-live-ea577b61f7d3 Repository: https://github.com/data-preservation-programs/RetrievalBot

Possible solutions

❌ ~~A hard-coded list of (CID, address) to pick from. This list must be private to SPARK Orchestrator (SPs must not be able to access it.)~~
- ~~Upsides: Easy to implement in a centralised services. We already have this.~~
- ~~Downsides:~~
  - ~~This cannot be implemented as a smart contract. A smart contract will need the hard-coded list to be public, thus SPs would be able to predict checks.~~
  - ~~We need to periodically update the list. (Extra maintenance cost.)~~
  - ~~Not useful to the Reputation group, they want a source of CID that the community can trust.~~
A random walk of IPNI advertisements, using DRAND as a source of randomness.
1. Problem: there is only one IPNI instance → a single point of failure. There won’t be another consistent instance up & running by LabWeek.
2. This will be the solution we want to use in the long term; we will be able to trust a consortium of IPNI nodes to be correct.
What would our IPNI query look like?
- Inputs: DRAND seed, Deal ID
- Output: a CID randomly selected using DRAND seed
Can the IPNI team build & ship this API in time for us?

There is no verification that SPs are submitting all CIDs to IPNI and won’t be done by LabWeek.

→ Propose the new API - open a new GH issue in https://github.com/ipni/storetheindex/issues
A random walk of Filecoin storage deals
- Discussed here: ‣
- The expectation is that FIL+ data must be retrievable. There is a flag we can use to determine whether a deal is FIL+ or not.
- We can obtain SP Boost Worker HTTP endpoints by a query, we should run this query as part of job scheduling.
Algo:
- FIL state tree gives us a set of SPs (miners), and for each SP, we get the primary worker endpoint ⇒ that gives us about 700 active miners we can retrieve from.
- We have a list of all deals for each SP on the chain. Just a list of deals.
  - There is a catch: the output of StateMarketDeals method is over 3GB compressed, over 23GB decompressed. SPARK nodes cannot work with a dataset this large.
- We can use Boost-provided indexer API to query the SP what CID they have, but we can also fetch it from IPNI instead.
  - If the SP does not provide any indexes - we can flag the SP as unreliable.
  - If the index data is corrupted - ??
  - We can also download the full Piece, re-index it, and verify that SPs are correctly advertising correct indexes.
- Ask for a random offset inside PieceID and look for a CID block in that.
❌ For each SPARK deal, the party paying for the retrieval provides a public list of CIDs & addresses to check. (This will be presumably based on Filecoin storage deals made by the paying party.)
- This list can be directly linked to an instance of MERidian smart-contract governing the work and rewards.
- The sampling can be driven by DRAND randomness (later) or a centralised service (initially?).
- We don’t mind if SPs prioritise retrievals for CIDs on this public list over other retrievals, because the client paying SPARK wants to get good retrievals, right? Anybody can pay for their own SPARK retrieval contract to get better retrieval performance for their content.

Meeting notes

2023-09-11

Will’s doc

We don’t know that the advertisement is honest