<aside> 💡 The Production Engineering team merged with ProbeLab on 23rd January 2023. This page and those under it are to be considered historical and will eventually be archived.

</aside>

The KPI targeted for this project is the 7 day mean of the 95th centile measurement of the time to first byte as reported by all Gateway NGINX servers.

The TTFB encapsulates the time taken for go-ipfs to resolve and read a requested block. The block may be the root of a file dag that must be retrieved to fulfill the entire request, but this additional retrieval time may not be part of the TTFB metric if go-ipfs is able to begin streaming to the client immediately.

Quick Links

🔒 indicates private/internal resource

🔒 ProdEng Primary Gateway KPIs (grafana)
🔒 Gateway TTFB Breakdown (grafana)

Current Focus: Increase data locality

Github Epic: Increase data locality in Gateway

The overarching hypothesis here is that improvements to TTFB that can be achieved by ensuring that more requests are served using a local blockstore instead of waiting on network block discovery. Popular and expensive to fetch data should be held locally for longer by gateways, while still respecting limits on disk usage.

Intially we are investigating better, cache-aware, garbage collection strategies for go-ipfs (github issue). We theorize that blocks are being deleted from the blockstore by garbage collection only to be re-requested at a later date which is much slower. Garbage collection is required to keep within disk space limits but the current implementation is unaware of the usage patterns of unpinned blocks.

The default go-ipfs garbage collection strategy is mark and sweep:

mark all:
- pinned blocks, plus all of their descendants (recursively)
- bestEffortRoots, plus all of its descendants (recursively)
- directly pinned blocks
- blocks utilized internally by the pinner
then iterate over every block in the blockstore and delete any block not found in the marked set

(bestEffortRoots, by default, only includes the MFS root)

We think improving retention of valuable blocks is a tractable problem to solve for the current small ProdEng team (1.5 people). We plan to:

extend go-ipfs to allow different garbage collection strategies to be implemented.
implement a new garbage collection strategy that is tuned to Gateway use cases. This algorithm would assign a cost to blocks derived from the frequency of access and time taken to fetch from peers. The garbage collector then deletes blocks with least cost until a sufficient space has been freed. More detail here.

Outline of proposed approach:

Phase 1 (validate whether current GC is inefficient)
- *[Status: ~~In progress~~ Done; Est. Effort: ***1 week; ETA: 08 Jun 22] Roughly instrument go-ipfs block system to record the following metrics:
  - mean request rate of a block,
  - mean time to request the block from the network
- *[Status: ~~Not Started~~ In Progress; Est. Effort: 1 week, ***ETA: 16 Jun 2022] Use current request patterns (replay live traffic derived from ipfs.io) to evaluate metrics, and determine whether to proceed to phase 2