Professional Services Automated Build & Test

A layered testing methodology for a real-time game

Call of Orion is a Python/Arcade space-survival game with thousands of moving parts — physics, combat AI, pathfinding, save/load, and a live render loop. Keeping it stable takes a deliberate test pyramid: a wide base of fast unit tests, a middle tier of headless integration and performance gates that assert real frame-rate thresholds, and a narrow top of multi-minute soak runs — all bug-focused and green before anything merges.

Back to Professional Services
2,767
Fast unit tests
466
Integration & soak
3,233
Total — zero failures
~2.5 min
Fast-suite runtime

Why a pyramid

A game's logic and its render loop fail in different ways, so they need different tests. Pure logic — damage routing, inventory math, A* pathfinding, save round-trips — is covered by a huge, fast unit suite that runs in a couple of minutes and pinpoints regressions precisely. Anything that depends on a real Arcade window — frame timing, GPU rendering, full-scene behaviour — moves up into headless integration and performance tests. And slow-burn problems like memory growth or frame-rate decay only show up under sustained load, so they get their own soak tier.

The test pyramid

The Call of Orion test pyramid A three-tier test pyramid. The wide base is 2,767 fast unit tests that run in about two and a half minutes. The middle tier is integration and performance tests with frame-rate thresholds, GPU microbenchmarks, and six resolution presets. The narrow top is five-minute soak and endurance runs measuring FPS and memory stability. A ruff lint gate sits underneath. The base is the most numerous and fastest; the top is the broadest and slowest. Soak / endurance 5-min runs · FPS & RSS stability Integration + performance full-frame FPS thresholds · GPU benchmarks 6 resolution presets · real GameView flows Fast unit tests 2,767 tests · ~2.5 min physics · combat · AI · inventory · pathfinding · save/load ruff — bug-focused lint gate (undefined names, mutable defaults, closure bugs) broadest · slowest most numerous · fastest
Three tiers plus a lint gate: many fast, focused tests at the base; a few broad, slow ones at the top.

What each tier covers

TierScopeWhat it proves
Fast unitIsolated logic — no windowPlayer physics, weapons & melee arcs, asteroids, alien AI, pickups, blueprints, shields, damage routing, buildings, ship modules & AI-pilot behaviour, drones & A* pathing, inventory math, fog of war, and save/restore round-trips all behave exactly as specified.
Integration + performanceReal Arcade window (headless)Full-frame FPS holds above threshold across all three zones, trade and combat scenes, AI-pilot fleets and station shields; GPU rendering microbenchmarks and all six resolution presets stay within budget.
Soak / endurance5-minute sustained sessionsFPS and resident memory (RSS) stay flat over time — no leaks, no frame-rate decay — across idle, combat churn, dialogue, station-shield cycles, and Star-Maze pressure.

Fast by default, slow on demand. The default pytest run executes only the fast unit suite; the window-bound integration and soak tests are opt-in, because a shared Arcade window pollutes other tests' window-size math and each one is comparatively slow. Developers get a tight feedback loop locally; the full multi-hour suite runs as the pre-merge gate.

The quality gate

The pre-merge quality gate A code change passes through a ruff lint check, the fast unit suite, the integration and performance suite, and the soak suite. Only an all-green run is allowed to merge; any red gate blocks the merge. Code change pull request ruff lint bug-focused rules Fast unit 2,767 · ~2.5 min Integration + perf FPS thresholds Soak → all green = merge 5-min endurance Any red gate blocks the merge — the most recent full cycle ran 3,233 / 3,233 green.
Lint → fast unit → integration/performance → soak. Each result is recorded; only an all-green run merges.

Linting is treated as a bug gate, not a style police: the rule set is deliberately narrow, targeting the failure classes that have actually caused crashes — undefined names, variables used before assignment, mutable default arguments, and loop-variable closure bugs — without drowning the signal in whitespace nits. Every full cycle is written up with totals, durations, and any anomalies, so the suite's health is auditable over time.

Want a test suite that catches regressions before your users do?

Layered, bug-focused testing — from fast unit checks to performance and endurance gates — built for whatever you ship.