(keeping the changes section at the top; just to how we are measuring for current setup)

changes we want to make

Arc network needs better way of doing races, and to not use car files for every request. Proposal: I can imagine a bunch of different approaches here, but I'm wondering if we ultimately need to split the arc logic so it can be doing multiple "tests" or "experiments" at the same time. For instance, right now, logically we have 1 experiment running that is running through a list of requests (all for car files) against rhea, ipfs.io, and saturn. Instead, we could have 2 active experiments: 1 for testing car file requests comparing Saturn and ipfs.io (which may not add a lot of value other than just generating traffic to saturn nodes?), and another experiment which is racing requests against ipfs.io and rhea, and these requests would be the top N requests against ipfs.io, so we would expect most as format=raw (Slack thread)
We should investigate other log store solutions for saturn. This is not a great use of a SQL DB, and we will incur maintenance costs that we could lessen if we used a more off-the-shelf solution specific to logging. We have not determined when we should pull the trigger on doing this work.
Logging full request bodies. It has been suggested that we could have a logging mode where we infrequently log the entire response body. It was proposed we could even log entire response bodies at each layer of the stack. This could be useful if we wanted to do very deep-dive debugging into specific requests for behaviors that are not easily reproducible. We believe this may be worth doing, although it’s not immediately something we are needing for our investigations.

how we are measuring

bifrost load balancer (nginx) prometheus metrics

We measure at the nginx load balancers. This is monitoring data that goes into Prometheus and can be visualized in Grafana. This gives us (mostly) apples-to-apples look at the full request execution for both rhea and old gateway implementations.

This shows us the metrics at the closest point in our service to the user. We are currently using this data for our primary project Rhea metrics.

However, because the measurement is happening as part of the infra itself, we run into issues using them, particularly in our test configuration. For instance, the way traffic mirroring is configured means we are unable to rely on response size data from here.

saturn prod db

This is a postgres database that Saturn team maintains and is primary place where all logs of requests to L1’s go. There’s also a table with log lines written by Caboose. In addition, Saturn is stripping headers from Lassie requests and stuffing them in a table called “lassie_logs”. There’s a consistent ID now used across requests, and we’re building code to make it easy to pull together the full view of all logs for a request. We’re documenting the details of this data store here.

arc network

The arc network can be configured to send different sets of requests to different services. It’s currently doing a single “experiment” where a set of a requests (top requests to ipfs.io) are being sent to ipfs.io, rhea, and saturn L1’s directly (DNS binding). Currently these requests are all for car format; that needs to be fixed.

Database: The results from arc are stored in a table in the Saturn Prod DB called “race_requests”. Arc now has the same consistent tracing ID (”traceparent”) as the other tables in saturn_prod_db.

Prometheus: Arc data is also going into Prometheus/Grafana