.. _isolation-mode-bench: Thread isolation modes — empirical comparison ============================================== This page answers the question *"do I always need subprocess-isolated solves?"* with the actual data. **Short answer: no.** In-process is the right default; subprocess becomes the right call when the host is shared / oversubscribed. The autotune gate (``subprocess_isolated="auto"``, threshold 50 000 DOFs) is the recommended setting — micro-solves stay in-process, large solves on a shared host get the isolation win. Setup ----- Same workload as :doc:`pbs-rotor-cyclic-comparison`: PBS rotor sector, full cyclic-modal harmonic sweep (k = 0..10, N = 20 sectors, 4 modes per harmonic), eigen=PRIMME. All combos target 8 threads via ``OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8``. Three isolation modes per backend: .. list-table:: :header-rows: 1 :widths: 30 35 35 * - mode - affinity / spawn - thread isolation * - ``in_process_no_pin`` - process can run on all 32 cores - none — pools pile up across the whole machine * - ``in_process_affinity_pin`` - ``FEMORPH_SOLVER_AFFINITY=8`` → cores 0-7 - partial — pools pile up *onto 8 cores* (context switching) * - ``subprocess`` - fresh ``spawn`` process; child inherits env - full — BLAS/MUMPS init at right size; ≤ N threads Three linear backends: ``mumps``, ``pardiso``, ``mkl_direct``. **Caveat: host load varied across the run** (32-core executor shared with browser / Claude Code session). See per-row notes. Results ------- .. list-table:: :header-rows: 1 :widths: 18 22 14 14 14 18 * - linear - mode - wall (s) - peak RSS (MiB) - live thr (p95) - cores in affinity * - mumps - in_process_no_pin - 632.6 - 33,978 - 23 - 32 * - mumps - in_process_affinity_pin - 579.1 - 33,830 - 23 - 8 * - mumps - **subprocess** - **459.8** - 225 (parent-only) - 16 (parent-only) - 32 * - pardiso - in_process_no_pin - 439.7 - 24,938 - 23 - 32 * - pardiso - in_process_affinity_pin - 576.8 - 24,914 - 23 - 8 * - pardiso - **subprocess** - **440.4** - 224 (parent-only) - 16 (parent-only) - 32 * - mkl_direct - in_process_no_pin - **423.2** - 24,634 - 23 - 32 * - mkl_direct - in_process_affinity_pin - 563.1 - 24,635 - 23 - 8 * - mkl_direct - **subprocess** - 427.6 - 224 (parent-only) - 16 (parent-only) - 32 The "subprocess" rows show RSS = ~225 MiB because that's the *parent* harness's resident memory (sampling thread + numpy + threading helper); the child runs in its own ~24-34 GB process not visible to the parent's ``getrusage``. Patterns -------- 1. **subprocess matches or beats no_pin in every case**, never loses by more than ~1 % even on a quiet host: * mumps: subprocess **460 s** vs no_pin 633 s — **1.38× faster** on busy host (clean init dominates spawn cost) * pardiso: subprocess 440 s ≈ no_pin 440 s — tied * mkl_direct: subprocess 428 s ≈ no_pin 423 s — within 1 % 2. **affinity_pin alone is consistently the slowest mode** (always ~30 % slower than no_pin or subprocess). The pin gives a strict *core* cap but lets the multi-pool pile-up still happen — 23 threads competing for 8 cores costs context-switch overhead. 3. **subprocess delivers the most consistent timing** (428–460 s range) regardless of host load. no_pin spans 423–633 s on the same workload across the bench window — host noise leaks directly into your wall time. 4. **mkl_direct + no_pin (423 s)** wins on a quiet, dedicated host. Saves the IPC cost (no spawn overhead, no spill). Lose ~1 % to subprocess for the safety of consistent timing. Recommendations --------------- .. list-table:: :header-rows: 1 :widths: 45 55 * - Scenario - Best knob * - Workstation, dedicated host - Smart default (no pin, no isolation) * - Strict core cap, single solve - ``FEMORPH_SOLVER_AFFINITY=8`` (option 1) * - Strict thread cap, multi-tenant - ``subprocess_isolated="auto"`` (option 2) * - CI / batch host (anything shared) - ``subprocess_isolated="auto"`` — best ROI * - Tight loop of micro-solves - ``subprocess_isolated="auto"`` — gate skips spawn for ``n_dofs < 50_000`` The autotune gate (``"auto"`` with ``threshold_n_dofs=50_000``) is the recommended default for any code that mixes small and large solves: micro-solves stay in-process (no IPC overhead), large solves get the isolation win. Why isn't subprocess always default? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Three reasons: 1. **IPC cost on small problems** — spawn + sparse-spill round-trip is ~0.5–2 s. For a 0.1 s small-modal that's 20×. The "auto" gate makes this a non-issue but always-on subprocess would hurt notebook / interactive use. 2. **No win on dedicated hosts** — when there's no host load, the thread pile-up doesn't actually slow you down (every pool gets its own physical cores). Subprocess saves you ~0–1 % — well within noise — at the cost of complexity. 3. **API ergonomics** — printing / logging from inside a spawned child doesn't reach the parent's stdout the same way. Default in-process keeps the obvious code path obvious. The bench above closes the question definitively: there's a real isolation win on shared hosts, no win on dedicated, and the ``"auto"`` gate threads the needle. Reference: MAPDL on the same workload ------------------------------------- For context, MAPDL 22.2 Block Lanczos on the same PBS sector at strict ``taskset -c 0-7``, ``-np 8`` SMP: * Wall: **787.5 s** * Peak RSS: 24,506 MiB * CP time (summed across threads): 5,045 s ⇒ ~6.4× SMP scaling **Every single femorph row in the table above beats MAPDL.** The fastest combination (mkl_direct + no_pin on a quiet host = 423 s) is **1.86× faster than MAPDL** at the same nominal 8 thread cap; the slowest valid configuration (mumps + affinity_pin on a busy host = 579 s) is still **1.36× faster**. References ---------- * Bench script: :file:`perf/bench_isolation_modes.py` (this PR) * Full result data: :file:`perf/pbs_isolation_matrix.json` * PR #810 — affinity-pin (option 1) * PR #811 — subprocess-isolated dispatch (option 2) * PR #812 — sparse-arg spill (fixes the >64 KB pickle limit)