Closed-Loop vs Open-Loop Benchmarking
I have been working on benchmarking lately, especially since I started tinkering with AI. And I came back to an interesting topic of Closed loop vs Open loop benchmarking. To make is easier to explain - I built a simple HTTP server to simulate a controlled environment.
This simple http server responds in about 20ms most of the time, but 5% of requests take 200ms — simulating some occasional slowness. Real systems have this kind of blip for various reasons like GC pauses, segment merges during indexing, cache misses, the occasional lock contention etc.
I wrote two simple scripts. Both hit the same endpoint on the same server. But different approaches. Point to remember – both systems are single threaded – just to keep it simple.
The Two Approaches
Closed-loop — Sequential requests one at time. Essentially fire a request and wait for it to complete before you fire the next one. Open-loop — fire requests on a fixed schedule for e.g. in this case for 20 qps, one request is sent every 50ms.
Here is what i observed Closed-loop script output
Throughput: ~18 req/s (self-regulated: 1 / 0.055s mean latency)
p50: 44ms
p95: 223ms
p99: 245ms
Mean: 55ms
Open-loop at 20 QPS output
Achieved QPS: 19 (target 20, keeping up)
Client latency
p50: 43ms
p99: 227ms similar to closed loop.
Sojourn
p50: 1068ms
p99: 1717ms ← users waiting ~1s even at "healthy" load
Queue wait mean: 906ms
A quick glance tells you that the client throughput and latencies are similar. But lets look at sojourn times - P50 = 1068ms. This seems high right.
Lets now understand
What’s happening in Closed loop.
Closed loop regulates itself. It fires query only when the previous is completed. So it doesn’t push the system. Throughput drops. Here is a quick formula
throughput = concurrency / mean_latency
= 1 / 0.055s
= ~18 req/s
This though is not a property of the server. It is a constraint of the test method. The client is some ways is doing the throttling, not the server. We just have one car on the road and hence no traffic jam.
So what exactly is Sojourn time##
The Open Loop benchmarking script has two clocks.
scheduled_time = start_time # advances by fixed 50ms every loop
while time_not_up:
sleep_until(scheduled_time) # wait for our slot
req_start = time.time() # when we actually sent
response = session.get(url) # blocking — wait for response
req_end = time.time()
latency_ms = req_end - req_start # what the server spent: ~20ms or ~200ms
sojourn_ms = req_end - scheduled_time # scheduled slot → complete: user experience
scheduled_time += interval # next slot moves forward regardless
scheduled_time is a clock that ticks at the target rate no matter what. After a 200ms outlier at 20 QPS (50ms interval):
Now remember, that both of our experiments are single threaded. We can simulate this in multi-thread also – but single thread is just easier to explain. Also – in open loop – we wanted to fire the query every 50ms.
So here it goes
t=0ms: slot 0ms → send req1 → server takes 200ms → done at t=200ms
latency = 200ms sojourn = 200ms queue_wait = 0ms
t=200ms: slot 50ms was 150ms ago — send req2 immediately → server takes 20ms → done at t=220ms
latency = 20ms sojourn = 170ms queue_wait = 150ms
t=220ms: slot 100ms was 120ms ago — send req3 immediately → server takes 20ms → done at t=240ms
latency = 20ms sojourn = 140ms queue_wait = 120ms
Do you see what is happening. The script is still single-threaded — no actual concurrency. So if this was a real user- who had to sent a request at 50ms, they would have to wait for 150ms before their request got fired. And though the server latency is 20ms, for them it took 170ms.
Sojourn captures this gap in our script.
Ok, lets find what is the capacity cliff.
Capacity Cliff
The next experiment makes this more interesting. Here we increase load levels and record both server latencies and sojourn time for each step.
Target Achieved Lat p99 Sojourn p50 Sojourn p99 Q-Wait Status
------------------------------------------------------------------------------
5 5.2 220ms 42ms 220ms 1ms ✓ OK
10 10.2 221ms 33ms 222ms 4ms ✓ OK
20 19.4 217ms 108ms 256ms 61ms ✓ OK
30 25.2 225ms 70ms 1005ms 155ms ✗ SATURATED
40 26.0 214ms 1543ms 2712ms 1467ms ✗ SATURATED
50 23.1 227ms 2679ms 5805ms 2869ms ✗ SATURATED
What we are seeing is that irrespective of our target – the achieved QPS is flatting out. We can see that the server latencies are similar since nothing is changed for it. But look at Sojourn p50. At 20 QPS it's ~100ms. At 30 QPS it breaks out to ~1005ms and keeps climbing. That is the cliff – which is somewhere between 20-30qps. And now the average user has to wait over 1s for their calls to get fired. Also the system is saturating. The achieved QPS is only about 25. It cannot go faster and just accumulating the debt. Now you will not see this kind of pressure in a self regulated Closed loop testing. It will always report healthy latencies and you can be wrong by a factor here.
Finally – lets talk about the
Spike Test
We would simulate when system is under load, and has random spikes – like the endpoint runs at 20ms normally but takes a 2second pause exactly few seconds into the test. Lets simulate
Closed-loop result:
Total requests: 278 Actual QPS: 23.1
P50: 37ms
P95: 47ms
P99: 48ms ← spike barely registers
Max: 2030ms
Open-loop result at 20 QPS:
Total requests: 241 Missed deadlines: 81 (33.6%)
Server latency P50: 23ms P99: 48ms Max: 2003ms
Sojourn time P50: 43ms P99: 1950ms Max: 2004ms
Queue wait Mean: 954ms P99: 1954ms
Closed-loop p99: 48ms Open-loop sojourn p99: 1950ms.
Same server. Same 2-second pause. Same test duration.
This is sort of like a domino effect in play. In closed-loop only 1 request hit the spike or our two second pause. Our benchmark run has stopped measuring together with the server. This is what is known as co-ordinated omission – the system is working together to slow down and skip measuring requests that could have been in flight. In Open-loop, our clock is ticking. 40 requested were queued for 2 second’s delay. Their sojourn times show the full damage with p99 jumping to 2s for a server that normally runs at 23ms.
So which method is right for me.
This is where it is nuanced and simple at the same time if we understand it well.
If you want to size or cost the operation – Closed-loop is better. For e.g. which query is faster A or B.
You set concurrency to one, there are no queuing effects. Great during development experiments.
For can this system handle my real production load – Open-loop testing is better. It helps us answer questions like
- Will this API hold up under production load?
- Where is the capacity ceiling?
- What is UX during traffic spike or a GC pause?
We just set a target that matches expected load, and let it run. Keep an eye on both latency and sojourn time, and if sojourn time starts rising, you know requests are queuing up. The system gets saturated even though server looks relatively free.
The difference between a good and a bad benchmark could be which clock are you measuring.
Reference - https://www.scylladb.com/2021/04/22/on-coordinated-omission/