Thread isolation modes — empirical comparison#

This page answers the question “do I always need subprocess-isolated solves?” with the actual data.

Short answer: no. In-process is the right default; subprocess becomes the right call when the host is shared / oversubscribed. The autotune gate (subprocess_isolated="auto", threshold 50 000 DOFs) is the recommended setting — micro-solves stay in-process, large solves on a shared host get the isolation win.

Setup#

Same workload as PBS rotor — cyclic-modal sweep at 8 threads: PBS rotor sector, full cyclic-modal harmonic sweep (k = 0..10, N = 20 sectors, 4 modes per harmonic), eigen=PRIMME. All combos target 8 threads via OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8.

Three isolation modes per backend:

mode	affinity / spawn	thread isolation
`in_process_no_pin`	process can run on all 32 cores	none — pools pile up across the whole machine
`in_process_affinity_pin`	`FEMORPH_SOLVER_AFFINITY=8` → cores 0-7	partial — pools pile up onto 8 cores (context switching)
`subprocess`	fresh `spawn` process; child inherits env	full — BLAS/MUMPS init at right size; ≤ N threads

Three linear backends: mumps, pardiso, mkl_direct.

Caveat: host load varied across the run (32-core executor shared with browser / Claude Code session). See per-row notes.

Results#

linear	mode	wall (s)	peak RSS (MiB)	live thr (p95)	cores in affinity
mumps	in_process_no_pin	632.6	33,978	23	32
mumps	in_process_affinity_pin	579.1	33,830	23	8
mumps	subprocess	459.8	225 (parent-only)	16 (parent-only)	32
pardiso	in_process_no_pin	439.7	24,938	23	32
pardiso	in_process_affinity_pin	576.8	24,914	23	8
pardiso	subprocess	440.4	224 (parent-only)	16 (parent-only)	32
mkl_direct	in_process_no_pin	423.2	24,634	23	32
mkl_direct	in_process_affinity_pin	563.1	24,635	23	8
mkl_direct	subprocess	427.6	224 (parent-only)	16 (parent-only)	32

The “subprocess” rows show RSS = ~225 MiB because that’s the parent harness’s resident memory (sampling thread + numpy + threading helper); the child runs in its own ~24-34 GB process not visible to the parent’s getrusage.

Patterns#

subprocess matches or beats no_pin in every case, never loses by more than ~1 % even on a quiet host:
- mumps: subprocess 460 s vs no_pin 633 s — 1.38× faster on busy host (clean init dominates spawn cost)
- pardiso: subprocess 440 s ≈ no_pin 440 s — tied
- mkl_direct: subprocess 428 s ≈ no_pin 423 s — within 1 %
affinity_pin alone is consistently the slowest mode (always ~30 % slower than no_pin or subprocess). The pin gives a strict core cap but lets the multi-pool pile-up still happen — 23 threads competing for 8 cores costs context-switch overhead.
subprocess delivers the most consistent timing (428–460 s range) regardless of host load. no_pin spans 423–633 s on the same workload across the bench window — host noise leaks directly into your wall time.
mkl_direct + no_pin (423 s) wins on a quiet, dedicated host. Saves the IPC cost (no spawn overhead, no spill). Lose ~1 % to subprocess for the safety of consistent timing.

Recommendations#

Scenario	Best knob
Workstation, dedicated host	Smart default (no pin, no isolation)
Strict core cap, single solve	`FEMORPH_SOLVER_AFFINITY=8` (option 1)
Strict thread cap, multi-tenant	`subprocess_isolated="auto"` (option 2)
CI / batch host (anything shared)	`subprocess_isolated="auto"` — best ROI
Tight loop of micro-solves	`subprocess_isolated="auto"` — gate skips spawn for `n_dofs < 50_000`

The autotune gate ("auto" with threshold_n_dofs=50_000) is the recommended default for any code that mixes small and large solves: micro-solves stay in-process (no IPC overhead), large solves get the isolation win.

Why isn’t subprocess always default?#

Three reasons:

IPC cost on small problems — spawn + sparse-spill round-trip is ~0.5–2 s. For a 0.1 s small-modal that’s 20×. The “auto” gate makes this a non-issue but always-on subprocess would hurt notebook / interactive use.
No win on dedicated hosts — when there’s no host load, the thread pile-up doesn’t actually slow you down (every pool gets its own physical cores). Subprocess saves you ~0–1 % — well within noise — at the cost of complexity.
API ergonomics — printing / logging from inside a spawned child doesn’t reach the parent’s stdout the same way. Default in-process keeps the obvious code path obvious.

The bench above closes the question definitively: there’s a real isolation win on shared hosts, no win on dedicated, and the "auto" gate threads the needle.

Reference: MAPDL on the same workload#

For context, MAPDL 22.2 Block Lanczos on the same PBS sector at strict taskset -c 0-7, -np 8 SMP:

Wall: 787.5 s
Peak RSS: 24,506 MiB
CP time (summed across threads): 5,045 s ⇒ ~6.4× SMP scaling

Every single femorph row in the table above beats MAPDL. The fastest combination (mkl_direct + no_pin on a quiet host = 423 s) is 1.86× faster than MAPDL at the same nominal 8 thread cap; the slowest valid configuration (mumps + affinity_pin on a busy host = 579 s) is still 1.36× faster.

References#

Bench script: perf/bench_isolation_modes.py (this PR)
Full result data: perf/pbs_isolation_matrix.json
PR #810 — affinity-pin (option 1)
PR #811 — subprocess-isolated dispatch (option 2)
PR #812 — sparse-arg spill (fixes the >64 KB pickle limit)