Thunderdome Update 7 Sep 2022

As a reminder: Thunderdome is a way to compare the performance and behaviour of different versions of IPFS gateways using real traffic in a controlled environment. It’s being developed by the Production Engineering team, currently Ian Davis (@iand) and Tommy Hall (@thattommyhall).

In our last update Tommy demonstrated how we could define an experiment, stand it up and have traffic streamed to it within minutes. Since then we have been refining that process and improving the metrics we report and the dashboards we produce.

We’re now in the position where we can run several experiments concurrently. We have setup and run four different experiments for Kubo, each of which has been streamed live Gateway traffic for over 24 hours. This update lists some key findings and dives a bit deeper into the experiments we are running.

Key Findings

There is a possible performance regression in Kubo v0.15 when compared with earlier releases. TTFB P99 and P95 are both twice that for v0.14. (see experiment 2 below)
Triggered a probable deadlock in Kubo, indicated by high numbers of goroutines and a large heap. This may be a running example of https://github.com/ipfs/kubo/issues/8381 (see experiment 3 below)
Setting the DHT lookup provider delay too low could be detrimental to performance (issue https://github.com/ipfs/kubo/issues/8807) (see experiment 4 below)
Thunderdome has potential for uses beyond performance testing. It can be used to soak test instances and uncover hard to detect problems like the deadlock above. With enhanced tooling we can make it easy to run experiments and dig into any abnormalities uncovered.

We have only been running these experiments for 1-2 days so it is still early to be drawing conclusions. But each of the three significant experiments we have run has uncovered some interesting data or behaviour so we’re really excited that Thunderdome is proving to be useful tool to make available for everyone to use.

Each experiment is discussed in detail below but if you just want to dive into the metrics, these are the main charts we use on Grafana:

Experiment View - overview of key metrics as time series
Experiment View TTFB Comparison - details of TTFB metric at several quantiles
Experiment Report - 6 hour summary metrics of TTFB, number of goroutines, heap usage and response errors

Experiment One: tweedles

tweedles is our null experiment. We run two identical instances of Kubo (dee and dum) and send the same traffic to each of them. We do this to give us more confidence that dealgood, the component that sends the stream of requests and measures response times, is behaving well. Every instance in an experiment gets sent the same requests by dealgood so they should respond similarly.

In the tweedles experiment the TTFB metric shows almost no difference at any of the quantiles which indicates that we're running a fair test.

target metrics (tweedles experiment)

Resource utilization is similar too, with both averaging about 27k goroutines and using 5-6GiB heap.

In terms of request handling, both instances are receiving about 15 per second and there are almost zero drops. Drops indicate that the instance can't keep up with the number of requests that dealgood is sending: the request is dropped if there are too many in-flight at any one time. The number of concurrent requests and the maximum rate at which they are sent is configured on a per experiment basis. In the future we plan to adapt these dynamically as part of warming up the instances so we can make sure the instances are fully utilized.