<aside>
đź“… This document was last given an update pass on July 19, 2023
</aside>
This page should be a living document which lays out the approach, major issues and workstreams, and plan to see the Rhea project complete. It does not replace:
See the top-level Rhea project page for additional context/background; we won’t extensively explain or link out in the remainder of this document.
Since we’re focused on M1 right now, that’s the focus of this document. We’re not (yet) getting to this level of detail for later milestones.
Where we are (narrative)
- All of the system is built, with the exception of the “Bifrost Gateway Incremental Verification”, which is a rewrite of a part of bifrost-gateway to address a number of bugs and race conditions
- bifrost-gateway, Saturn L1, and Lassie all pass the Conformance Tests
- Currently Mirroring 30% of prod IPFS Gateway traffic to Rhea
- We are still working on correctness and latency issues
- Because we discovered a blocking problem in measuring response sizes (correctness metric) at the nginx load balancers, we had the idea to switch to using arc network measurements to validate response sizes
- As we started using arc network data, we found it’s actually a much richer way of measuring/assessing correctness and latency, because we are now able to directly compare, side by side, the same request being sent to both systems, and we can do this at scale over millions of such tests.
- We also found that we could combine this with data from other tables in the database to derive much richer insights into the system behavior than we have had to date. (we’ve also started keeping a comprehensive doc on the various data sources available).
- We have made modifications to arc to race non-car requests to old and new gateways; this included some re-architecting of arc to support a notion of “tests” where we can in the future more easily spin up new comparisons we want arc to run for us.
- We now have a trace id that not only is threaded through all the components in the stack but also is accessible in arc, letting us tie a race to a complete record of that request’s flow in the system. And we can aggregate over all the traces we have easily.
- The biggest problem with arc right now is that we aren’t actually running representative query sets through it. The requests we use come from looking at Saturn and/or Lassie logs, and there’s a bunch of reasons this doesn’t end up looking like what the gateway itself sees.
- To build confidence in the arc data and use this for launch decision, we need to work with bifrost team to do actual representative sampling from the nginx logs for the gateway.
- We believe we are now hitting our goal on response size validation, until we can re-evaluate with a truly representative traffic sample (details)
- We are very close to hitting our goal on response code validation, until we can re-evaluate with a truly representative traffic sample (details)
- Latency as measured by arc is looking surprisingly good as well and we’re starting to dig in more here