Gino Kaleb
Gino Kaleb SYS ADMIN
|
ES | EN

The art of letting go of control

I stopped chasing perfect TPS and started measuring what really matters: resilience, recovery, and context.

The art of letting go of control: From obsession with perfect TPS to real resilience

A little while ago I shared how HeuristicOptimizer was born and how I structured its architecture to survive Folia’s massive concurrency. Designing the system on paper (and in code) was a beautiful challenge, almost textbook-like. But if backend development teaches you anything, it’s that code compiles beautifully until you confront it with the real world. And in the context of a community Minecraft server, where the budget is limited and hardware shows no mercy, the real world hits hard.

Today I want to talk about the testing phase. That stage where your developer ego deflates and it’s time to be truly pragmatic.

When I started putting the plugin through load testing, my mind was operating under a classic and somewhat naive paradigm: performance must be flawless. I wanted to see the console showing an average of 19.5 TPS (Ticks Per Second) no matter how many players or entities were present.

I built a fairly robust automated environment. I orchestrated PowerShell scripts that started the server locally, compiled the JAR, and then released a horde of Node.js bots (Mineflayer) to simulate stress scenarios. We had everything: from a nominal load of moving players to sustained “lag bombs” for hours in what we called Phase D (Soak Testing).

The initial result in my local environment? The tests failed constantly. TPS dropped, the test aborted, and I got frustrated. I wondered whether my Java code was inefficient, whether Folia’s asynchronous threads were getting blocked, or whether my machine simply couldn’t handle it.

And then, there was a mindset shift that reframed the entire project: a simple “Test failed because TPS dropped below 19.5” is the technical equivalent of your mechanic telling you “the car makes a weird noise” and walking away. It gives you no context, no tools, and worse, it hides the true nature of the problem.

The false-negative trap and the value of telemetry

Think about it this way: in my first model, if the server suffered a sudden lag spike at minute 10 of a 4-hour stress test, the orchestrator aborted execution immediately. Boom. Red in the console.

What did I learn from that? Practically nothing. I was blind to what would have happened in the next 3 hours and 50 minutes. Would the server have recovered a minute later? Was it an isolated spike caused by the Garbage Collector, or the beginning of a progressive and lethal memory leak? I would never know because the system killed the test before letting the plugin try to do its job.

I realized that in real systems administration, stress spikes are not a bug, they are a structural feature. Players will build massive farms and trigger absurd redstone mechanisms all at once. Trying to make a resource-limited server absorb that without flinching is a technical utopia. What really matters is not avoiding the hit, but how fast you get back up after taking it.

This led me to redesign the QA strategy and implement a dual-gate model (Dual-Gate Model):

The Operational Gate (Blocking): The basics. The server must not crash under nominal loads and must not spit out critical errors. If the server shuts down, the test fails and stops.

The Resilience Gate (Observational): Here’s where the magic is. It runs in shadow mode. If a massive spike hits and TPS collapses, the test does not stop. Instead, it starts taking notes frantically.

Why does this work a thousand times better?

Technically, this observational model gives us pure telemetry instead of a binary verdict. Now we measure three vital things:

Peak Degradation (%): How much did the hit hurt? (Ex: “TPS dropped 44% compared to normal”).

Recovery Latency: The most important stopwatch. How long does HeuristicOptimizer take to realize the disaster, apply its policies (like pausing mob AI in a radius or reducing player synchronization), and bring TPS back to 90% of its capacity? Was it 10 seconds or 5 minutes?

Soak Degraded (%): During a 4-hour test, did we spend 30% of the time suffocating, or were there only intermittent drops?

On top of that, this approach solved a hardware existential crisis. I develop and test on my local machine (which doesn’t have the multicore power of our production environment). The old model demanded CI (Continuous Integration) performance at home, which generated “failures” all the time. By taking these variables out of scripts and moving them into a declarative JSON contract, we created SLO Profiles (Service Level Objectives).

Now my environment knows that in the local profile, keeping 12 TPS under a lag bomb and recovering in less than 300 seconds is a huge success. In production (ci), the bar rises to 19.5 TPS. We stopped fighting hardware and started measuring software adaptability.

The code no longer tries to be a rigid stone wall that eventually cracks, but a shock absorber that yields, absorbs impact, and recovers autonomously. And honestly, having a CSV table detailing exactly how your code fought and survived a stress spike is far more useful (and satisfying) than seeing a simple green “Test Passed” text.

In the coming weeks I’ll be consolidating evidence from these tests to decide whether we finally give it the green light for production. Wish me luck.