Closed-Loop vs Open-Loop Benchmarking

I have been working on benchmarking lately, especially since I started tinkering with AI. And I came back to an interesting topic of Closed loop vs Open loop benchmarking. To make is easier to explain - I built a simple HTTP server to simulate a controlled environment.

This simple http server responds in about 20ms most of the time, but 5% of requests take 200ms — simulating some occasional slowness. Real systems have this kind of blip for various reasons like GC pauses, segment merges during indexing, cache misses, the occasional lock contention etc.

I wrote two simple scripts. Both hit the same endpoint on the same server. But different approaches. Point to remember – both systems are single threaded – just to keep it simple.

The Two Approaches

Closed-loop — Sequential requests one at time. Essentially fire a request and wait for it to complete before you fire the next one. Open-loop — fire requests on a fixed schedule for e.g. in this case for 20 qps, one request is sent every 50ms.

Here is what i observed Closed-loop script output

Throughput:  ~18 req/s   (self-regulated: 1 / 0.055s mean latency)
p50:  44ms
p95: 223ms
p99: 245ms 
Mean: 55ms

Open-loop at 20 QPS output

Achieved QPS: 19  (target 20, keeping up)
Client latency 
p50:  43ms   
p99: 227ms    similar to closed loop.

Sojourn 
p50:  1068ms         
p99: 1717ms  ← users waiting ~1s even at "healthy" load
Queue wait mean: 906ms

A quick glance tells you that the client throughput and latencies are similar. But lets look at sojourn times - P50 = 1068ms. This seems high right.

Lets now understand

What’s happening in Closed loop.

Closed loop regulates itself. It fires query only when the previous is completed. So it doesn’t push the system. Throughput drops. Here is a quick formula

throughput = concurrency / mean_latency
           = 1 / 0.055s
           = ~18 req/s

This though is not a property of the server. It is a constraint of the test method. The client is some ways is doing the throttling, not the server. We just have one car on the road and hence no traffic jam.

So what exactly is Sojourn time##

The Open Loop benchmarking script has two clocks.

scheduled_time = start_time     # advances by fixed 50ms every loop

while time_not_up:
    sleep_until(scheduled_time)  # wait for our slot

    req_start = time.time()      # when we actually sent
    response  = session.get(url) # blocking — wait for response
    req_end   = time.time()

    latency_ms = req_end - req_start       # what the server spent: ~20ms or ~200ms
    sojourn_ms = req_end - scheduled_time  # scheduled slot → complete: user experience

    scheduled_time += interval   # next slot moves forward regardless

scheduled_time is a clock that ticks at the target rate no matter what. After a 200ms outlier at 20 QPS (50ms interval):

Now remember, that both of our experiments are single threaded. We can simulate this in multi-thread also – but single thread is just easier to explain. Also – in open loop – we wanted to fire the query every 50ms.

So here it goes

t=0ms:   slot 0ms → send req1 → server takes 200ms → done at t=200ms
         latency = 200ms   sojourn = 200ms   queue_wait = 0ms

t=200ms: slot 50ms was 150ms ago — send req2 immediately → server takes 20ms → done at t=220ms
         latency = 20ms    sojourn = 170ms   queue_wait = 150ms

t=220ms: slot 100ms was 120ms ago — send req3 immediately → server takes 20ms → done at t=240ms
         latency  = 20ms    sojourn = 140ms   queue_wait = 120ms

Do you see what is happening. The script is still single-threaded — no actual concurrency. So if this was a real user- who had to sent a request at 50ms, they would have to wait for 150ms before their request got fired. And though the server latency is 20ms, for them it took 170ms.

Sojourn captures this gap in our script.

Ok, lets find what is the capacity cliff.

Capacity Cliff

The next experiment makes this more interesting. Here we increase load levels and record both server latencies and sojourn time for each step.

Target   Achieved   Lat p99    Sojourn p50   Sojourn p99   Q-Wait     Status
------------------------------------------------------------------------------
5        5.2        220ms      42ms          220ms         1ms        ✓ OK
10       10.2       221ms      33ms          222ms         4ms        ✓ OK
20       19.4       217ms      108ms         256ms         61ms       ✓ OK
30       25.2       225ms      70ms          1005ms        155ms      ✗ SATURATED
40       26.0       214ms      1543ms        2712ms        1467ms     ✗ SATURATED
50       23.1       227ms      2679ms        5805ms        2869ms     ✗ SATURATED

What we are seeing is that irrespective of our target – the achieved QPS is flatting out. We can see that the server latencies are similar since nothing is changed for it. But look at Sojourn p50. At 20 QPS it's ~100ms. At 30 QPS it breaks out to ~1005ms and keeps climbing. That is the cliff – which is somewhere between 20-30qps. And now the average user has to wait over 1s for their calls to get fired. Also the system is saturating. The achieved QPS is only about 25. It cannot go faster and just accumulating the debt. Now you will not see this kind of pressure in a self regulated Closed loop testing. It will always report healthy latencies and you can be wrong by a factor here.

Finally – lets talk about the

Spike Test

We would simulate when system is under load, and has random spikes – like the endpoint runs at 20ms normally but takes a 2second pause exactly few seconds into the test. Lets simulate

Closed-loop result:

Total requests: 278   Actual QPS: 23.1
  P50:  37ms
  P95:  47ms
  P99:  48ms    ← spike barely registers
  Max:  2030ms

Open-loop result at 20 QPS:

Total requests: 241   Missed deadlines: 81 (33.6%)
  Server latency  P50: 23ms    P99: 48ms     Max: 2003ms
  Sojourn time    P50: 43ms    P99: 1950ms   Max: 2004ms
  Queue wait      Mean: 954ms  P99: 1954ms

Closed-loop p99: 48ms Open-loop sojourn p99: 1950ms.

Same server. Same 2-second pause. Same test duration.

This is sort of like a domino effect in play. In closed-loop only 1 request hit the spike or our two second pause. Our benchmark run has stopped measuring together with the server. This is what is known as co-ordinated omission – the system is working together to slow down and skip measuring requests that could have been in flight. In Open-loop, our clock is ticking. 40 requested were queued for 2 second’s delay. Their sojourn times show the full damage with p99 jumping to 2s for a server that normally runs at 23ms.

So which method is right for me.

This is where it is nuanced and simple at the same time if we understand it well. If you want to size or cost the operation – Closed-loop is better. For e.g. which query is faster A or B.
You set concurrency to one, there are no queuing effects. Great during development experiments.

For can this system handle my real production load – Open-loop testing is better. It helps us answer questions like

Will this API hold up under production load?
Where is the capacity ceiling?
What is UX during traffic spike or a GC pause?

We just set a target that matches expected load, and let it run. Keep an eye on both latency and sojourn time, and if sojourn time starts rising, you know requests are queuing up. The system gets saturated even though server looks relatively free.

The difference between a good and a bad benchmark could be which clock are you measuring.

Reference - https://www.scylladb.com/2021/04/22/on-coordinated-omission/