Thread isolation modes — empirical comparison#

This page answers the question “do I always need subprocess-isolated solves?” with the actual data.

Short answer: no. In-process is the right default; subprocess becomes the right call when the host is shared / oversubscribed. The autotune gate (subprocess_isolated="auto", threshold 50 000 DOFs) is the recommended setting — micro-solves stay in-process, large solves on a shared host get the isolation win.

Setup#

Same workload as PBS rotor — cyclic-modal sweep at 8 threads: PBS rotor sector, full cyclic-modal harmonic sweep (k = 0..10, N = 20 sectors, 4 modes per harmonic), eigen=PRIMME. All combos target 8 threads via OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8.

Three isolation modes per backend:

mode

affinity / spawn

thread isolation

in_process_no_pin

process can run on all 32 cores

none — pools pile up across the whole machine

in_process_affinity_pin

FEMORPH_SOLVER_AFFINITY=8 → cores 0-7

partial — pools pile up onto 8 cores (context switching)

subprocess

fresh spawn process; child inherits env

full — BLAS/MUMPS init at right size; ≤ N threads

Three linear backends: mumps, pardiso, mkl_direct.

Caveat: host load varied across the run (32-core executor shared with browser / Claude Code session). See per-row notes.

Results#

linear

mode

wall (s)

peak RSS (MiB)

live thr (p95)

cores in affinity

mumps

in_process_no_pin

632.6

33,978

23

32

mumps

in_process_affinity_pin

579.1

33,830

23

8

mumps

subprocess

459.8

225 (parent-only)

16 (parent-only)

32

pardiso

in_process_no_pin

439.7

24,938

23

32

pardiso

in_process_affinity_pin

576.8

24,914

23

8

pardiso

subprocess

440.4

224 (parent-only)

16 (parent-only)

32

mkl_direct

in_process_no_pin

423.2

24,634

23

32

mkl_direct

in_process_affinity_pin

563.1

24,635

23

8

mkl_direct

subprocess

427.6

224 (parent-only)

16 (parent-only)

32

The “subprocess” rows show RSS = ~225 MiB because that’s the parent harness’s resident memory (sampling thread + numpy + threading helper); the child runs in its own ~24-34 GB process not visible to the parent’s getrusage.

Patterns#

  1. subprocess matches or beats no_pin in every case, never loses by more than ~1 % even on a quiet host:

    • mumps: subprocess 460 s vs no_pin 633 s — 1.38× faster on busy host (clean init dominates spawn cost)

    • pardiso: subprocess 440 s ≈ no_pin 440 s — tied

    • mkl_direct: subprocess 428 s ≈ no_pin 423 s — within 1 %

  2. affinity_pin alone is consistently the slowest mode (always ~30 % slower than no_pin or subprocess). The pin gives a strict core cap but lets the multi-pool pile-up still happen — 23 threads competing for 8 cores costs context-switch overhead.

  3. subprocess delivers the most consistent timing (428–460 s range) regardless of host load. no_pin spans 423–633 s on the same workload across the bench window — host noise leaks directly into your wall time.

  4. mkl_direct + no_pin (423 s) wins on a quiet, dedicated host. Saves the IPC cost (no spawn overhead, no spill). Lose ~1 % to subprocess for the safety of consistent timing.

Recommendations#

Scenario

Best knob

Workstation, dedicated host

Smart default (no pin, no isolation)

Strict core cap, single solve

FEMORPH_SOLVER_AFFINITY=8 (option 1)

Strict thread cap, multi-tenant

subprocess_isolated="auto" (option 2)

CI / batch host (anything shared)

subprocess_isolated="auto" — best ROI

Tight loop of micro-solves

subprocess_isolated="auto" — gate skips spawn for n_dofs < 50_000

The autotune gate ("auto" with threshold_n_dofs=50_000) is the recommended default for any code that mixes small and large solves: micro-solves stay in-process (no IPC overhead), large solves get the isolation win.

Why isn’t subprocess always default?#

Three reasons:

  1. IPC cost on small problems — spawn + sparse-spill round-trip is ~0.5–2 s. For a 0.1 s small-modal that’s 20×. The “auto” gate makes this a non-issue but always-on subprocess would hurt notebook / interactive use.

  2. No win on dedicated hosts — when there’s no host load, the thread pile-up doesn’t actually slow you down (every pool gets its own physical cores). Subprocess saves you ~0–1 % — well within noise — at the cost of complexity.

  3. API ergonomics — printing / logging from inside a spawned child doesn’t reach the parent’s stdout the same way. Default in-process keeps the obvious code path obvious.

The bench above closes the question definitively: there’s a real isolation win on shared hosts, no win on dedicated, and the "auto" gate threads the needle.

Reference: MAPDL on the same workload#

For context, MAPDL 22.2 Block Lanczos on the same PBS sector at strict taskset -c 0-7, -np 8 SMP:

  • Wall: 787.5 s

  • Peak RSS: 24,506 MiB

  • CP time (summed across threads): 5,045 s ⇒ ~6.4× SMP scaling

Every single femorph row in the table above beats MAPDL. The fastest combination (mkl_direct + no_pin on a quiet host = 423 s) is 1.86× faster than MAPDL at the same nominal 8 thread cap; the slowest valid configuration (mumps + affinity_pin on a busy host = 579 s) is still 1.36× faster.

References#

  • Bench script: perf/bench_isolation_modes.py (this PR)

  • Full result data: perf/pbs_isolation_matrix.json

  • PR #810 — affinity-pin (option 1)

  • PR #811 — subprocess-isolated dispatch (option 2)

  • PR #812 — sparse-arg spill (fixes the >64 KB pickle limit)