Thread isolation modes — empirical comparison#
This page answers the question “do I always need subprocess-isolated solves?” with the actual data.
Short answer: no. In-process is the right default; subprocess
becomes the right call when the host is shared / oversubscribed.
The autotune gate (subprocess_isolated="auto", threshold
50 000 DOFs) is the recommended setting — micro-solves stay
in-process, large solves on a shared host get the isolation win.
Setup#
Same workload as PBS rotor — cyclic-modal sweep at 8 threads: PBS rotor
sector, full cyclic-modal harmonic sweep (k = 0..10, N = 20
sectors, 4 modes per harmonic), eigen=PRIMME. All combos target
8 threads via OMP_NUM_THREADS = MKL_NUM_THREADS =
OPENBLAS_NUM_THREADS = 8.
Three isolation modes per backend:
mode |
affinity / spawn |
thread isolation |
|---|---|---|
|
process can run on all 32 cores |
none — pools pile up across the whole machine |
|
|
partial — pools pile up onto 8 cores (context switching) |
|
fresh |
full — BLAS/MUMPS init at right size; ≤ N threads |
Three linear backends: mumps, pardiso, mkl_direct.
Caveat: host load varied across the run (32-core executor shared with browser / Claude Code session). See per-row notes.
Results#
linear |
mode |
wall (s) |
peak RSS (MiB) |
live thr (p95) |
cores in affinity |
|---|---|---|---|---|---|
mumps |
in_process_no_pin |
632.6 |
33,978 |
23 |
32 |
mumps |
in_process_affinity_pin |
579.1 |
33,830 |
23 |
8 |
mumps |
subprocess |
459.8 |
225 (parent-only) |
16 (parent-only) |
32 |
pardiso |
in_process_no_pin |
439.7 |
24,938 |
23 |
32 |
pardiso |
in_process_affinity_pin |
576.8 |
24,914 |
23 |
8 |
pardiso |
subprocess |
440.4 |
224 (parent-only) |
16 (parent-only) |
32 |
mkl_direct |
in_process_no_pin |
423.2 |
24,634 |
23 |
32 |
mkl_direct |
in_process_affinity_pin |
563.1 |
24,635 |
23 |
8 |
mkl_direct |
subprocess |
427.6 |
224 (parent-only) |
16 (parent-only) |
32 |
The “subprocess” rows show RSS = ~225 MiB because that’s the
parent harness’s resident memory (sampling thread + numpy +
threading helper); the child runs in its own ~24-34 GB process
not visible to the parent’s getrusage.
Patterns#
subprocess matches or beats no_pin in every case, never loses by more than ~1 % even on a quiet host:
mumps: subprocess 460 s vs no_pin 633 s — 1.38× faster on busy host (clean init dominates spawn cost)
pardiso: subprocess 440 s ≈ no_pin 440 s — tied
mkl_direct: subprocess 428 s ≈ no_pin 423 s — within 1 %
affinity_pin alone is consistently the slowest mode (always ~30 % slower than no_pin or subprocess). The pin gives a strict core cap but lets the multi-pool pile-up still happen — 23 threads competing for 8 cores costs context-switch overhead.
subprocess delivers the most consistent timing (428–460 s range) regardless of host load. no_pin spans 423–633 s on the same workload across the bench window — host noise leaks directly into your wall time.
mkl_direct + no_pin (423 s) wins on a quiet, dedicated host. Saves the IPC cost (no spawn overhead, no spill). Lose ~1 % to subprocess for the safety of consistent timing.
Recommendations#
Scenario |
Best knob |
|---|---|
Workstation, dedicated host |
Smart default (no pin, no isolation) |
Strict core cap, single solve |
|
Strict thread cap, multi-tenant |
|
CI / batch host (anything shared) |
|
Tight loop of micro-solves |
|
The autotune gate ("auto" with threshold_n_dofs=50_000) is
the recommended default for any code that mixes small and large
solves: micro-solves stay in-process (no IPC overhead), large
solves get the isolation win.
Why isn’t subprocess always default?#
Three reasons:
IPC cost on small problems — spawn + sparse-spill round-trip is ~0.5–2 s. For a 0.1 s small-modal that’s 20×. The “auto” gate makes this a non-issue but always-on subprocess would hurt notebook / interactive use.
No win on dedicated hosts — when there’s no host load, the thread pile-up doesn’t actually slow you down (every pool gets its own physical cores). Subprocess saves you ~0–1 % — well within noise — at the cost of complexity.
API ergonomics — printing / logging from inside a spawned child doesn’t reach the parent’s stdout the same way. Default in-process keeps the obvious code path obvious.
The bench above closes the question definitively: there’s a real
isolation win on shared hosts, no win on dedicated, and the
"auto" gate threads the needle.
Reference: MAPDL on the same workload#
For context, MAPDL 22.2 Block Lanczos on the same PBS sector at
strict taskset -c 0-7, -np 8 SMP:
Wall: 787.5 s
Peak RSS: 24,506 MiB
CP time (summed across threads): 5,045 s ⇒ ~6.4× SMP scaling
Every single femorph row in the table above beats MAPDL. The fastest combination (mkl_direct + no_pin on a quiet host = 423 s) is 1.86× faster than MAPDL at the same nominal 8 thread cap; the slowest valid configuration (mumps + affinity_pin on a busy host = 579 s) is still 1.36× faster.
References#
Bench script:
perf/bench_isolation_modes.py(this PR)Full result data:
perf/pbs_isolation_matrix.jsonPR #810 — affinity-pin (option 1)
PR #811 — subprocess-isolated dispatch (option 2)
PR #812 — sparse-arg spill (fixes the >64 KB pickle limit)