.. _isolation-mode-bench:

Thread isolation modes — empirical comparison
==============================================

This page answers the question *"do I always need
subprocess-isolated solves?"* with the actual data.

**Short answer: no.**  In-process is the right default; subprocess
becomes the right call when the host is shared / oversubscribed.
The autotune gate (``subprocess_isolated="auto"``, threshold
50 000 DOFs) is the recommended setting — micro-solves stay
in-process, large solves on a shared host get the isolation win.

Setup
-----

Same workload as :doc:`pbs-rotor-cyclic-comparison`: PBS rotor
sector, full cyclic-modal harmonic sweep (k = 0..10, N = 20
sectors, 4 modes per harmonic), eigen=PRIMME.  All combos target
8 threads via ``OMP_NUM_THREADS = MKL_NUM_THREADS =
OPENBLAS_NUM_THREADS = 8``.

Three isolation modes per backend:

.. list-table::
   :header-rows: 1
   :widths: 30 35 35

   * - mode
     - affinity / spawn
     - thread isolation
   * - ``in_process_no_pin``
     - process can run on all 32 cores
     - none — pools pile up across the whole machine
   * - ``in_process_affinity_pin``
     - ``FEMORPH_SOLVER_AFFINITY=8`` → cores 0-7
     - partial — pools pile up *onto 8 cores* (context switching)
   * - ``subprocess``
     - fresh ``spawn`` process; child inherits env
     - full — BLAS/MUMPS init at right size; ≤ N threads

Three linear backends: ``mumps``, ``pardiso``, ``mkl_direct``.

**Caveat: host load varied across the run** (32-core executor
shared with browser / Claude Code session).  See per-row notes.

Results
-------

.. list-table::
   :header-rows: 1
   :widths: 18 22 14 14 14 18

   * - linear
     - mode
     - wall (s)
     - peak RSS (MiB)
     - live thr (p95)
     - cores in affinity
   * - mumps
     - in_process_no_pin
     - 632.6
     - 33,978
     - 23
     - 32
   * - mumps
     - in_process_affinity_pin
     - 579.1
     - 33,830
     - 23
     - 8
   * - mumps
     - **subprocess**
     - **459.8**
     - 225 (parent-only)
     - 16 (parent-only)
     - 32
   * - pardiso
     - in_process_no_pin
     - 439.7
     - 24,938
     - 23
     - 32
   * - pardiso
     - in_process_affinity_pin
     - 576.8
     - 24,914
     - 23
     - 8
   * - pardiso
     - **subprocess**
     - **440.4**
     - 224 (parent-only)
     - 16 (parent-only)
     - 32
   * - mkl_direct
     - in_process_no_pin
     - **423.2**
     - 24,634
     - 23
     - 32
   * - mkl_direct
     - in_process_affinity_pin
     - 563.1
     - 24,635
     - 23
     - 8
   * - mkl_direct
     - **subprocess**
     - 427.6
     - 224 (parent-only)
     - 16 (parent-only)
     - 32

The "subprocess" rows show RSS = ~225 MiB because that's the
*parent* harness's resident memory (sampling thread + numpy +
threading helper); the child runs in its own ~24-34 GB process
not visible to the parent's ``getrusage``.

Patterns
--------

1. **subprocess matches or beats no_pin in every case**, never
   loses by more than ~1 % even on a quiet host:

   * mumps: subprocess **460 s** vs no_pin 633 s — **1.38× faster**
     on busy host (clean init dominates spawn cost)
   * pardiso: subprocess 440 s ≈ no_pin 440 s — tied
   * mkl_direct: subprocess 428 s ≈ no_pin 423 s — within 1 %

2. **affinity_pin alone is consistently the slowest mode** (always
   ~30 % slower than no_pin or subprocess).  The pin gives a strict
   *core* cap but lets the multi-pool pile-up still happen — 23
   threads competing for 8 cores costs context-switch overhead.

3. **subprocess delivers the most consistent timing** (428–460 s
   range) regardless of host load.  no_pin spans 423–633 s on the
   same workload across the bench window — host noise leaks
   directly into your wall time.

4. **mkl_direct + no_pin (423 s)** wins on a quiet, dedicated host.
   Saves the IPC cost (no spawn overhead, no spill).  Lose ~1 % to
   subprocess for the safety of consistent timing.

Recommendations
---------------

.. list-table::
   :header-rows: 1
   :widths: 45 55

   * - Scenario
     - Best knob
   * - Workstation, dedicated host
     - Smart default (no pin, no isolation)
   * - Strict core cap, single solve
     - ``FEMORPH_SOLVER_AFFINITY=8`` (option 1)
   * - Strict thread cap, multi-tenant
     - ``subprocess_isolated="auto"`` (option 2)
   * - CI / batch host (anything shared)
     - ``subprocess_isolated="auto"`` — best ROI
   * - Tight loop of micro-solves
     - ``subprocess_isolated="auto"`` — gate skips spawn for
       ``n_dofs < 50_000``

The autotune gate (``"auto"`` with ``threshold_n_dofs=50_000``) is
the recommended default for any code that mixes small and large
solves: micro-solves stay in-process (no IPC overhead), large
solves get the isolation win.

Why isn't subprocess always default?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Three reasons:

1. **IPC cost on small problems** — spawn + sparse-spill round-trip
   is ~0.5–2 s.  For a 0.1 s small-modal that's 20×.  The "auto"
   gate makes this a non-issue but always-on subprocess would hurt
   notebook / interactive use.
2. **No win on dedicated hosts** — when there's no host load, the
   thread pile-up doesn't actually slow you down (every pool gets
   its own physical cores).  Subprocess saves you ~0–1 % — well
   within noise — at the cost of complexity.
3. **API ergonomics** — printing / logging from inside a spawned
   child doesn't reach the parent's stdout the same way.  Default
   in-process keeps the obvious code path obvious.

The bench above closes the question definitively: there's a real
isolation win on shared hosts, no win on dedicated, and the
``"auto"`` gate threads the needle.

Reference: MAPDL on the same workload
-------------------------------------

For context, MAPDL 22.2 Block Lanczos on the same PBS sector at
strict ``taskset -c 0-7``, ``-np 8`` SMP:

* Wall: **787.5 s**
* Peak RSS: 24,506 MiB
* CP time (summed across threads): 5,045 s ⇒ ~6.4× SMP scaling

**Every single femorph row in the table above beats MAPDL.**  The
fastest combination (mkl_direct + no_pin on a quiet host = 423 s)
is **1.86× faster than MAPDL** at the same nominal 8 thread cap;
the slowest valid configuration (mumps + affinity_pin on a busy
host = 579 s) is still **1.36× faster**.

References
----------

* Bench script: :file:`perf/bench_isolation_modes.py` (this PR)
* Full result data: :file:`perf/pbs_isolation_matrix.json`
* PR #810 — affinity-pin (option 1)
* PR #811 — subprocess-isolated dispatch (option 2)
* PR #812 — sparse-arg spill (fixes the >64 KB pickle limit)