Solver defaults and threading
=============================

This page documents *why* femorph-solver picks the defaults it does
and *how* to change them.  Every default here was set after a
benchmark sweep on a real model — citations point at
``tests/perf/`` artifacts and the bench script that produced the
numbers.

.. contents:: On this page
   :local:
   :depth: 2

Smart thread default
--------------------

At import time, ``femorph_solver`` sets the BLAS / OpenMP thread
cap to ``min(8, host_cores)``.  The cap is applied two ways:

* ``OMP_NUM_THREADS`` / ``MKL_NUM_THREADS`` / ``OPENBLAS_NUM_THREADS``
  are seeded so BLAS pools come up at the right size from the start.
* ``femorph_solver._threads.get_thread_limit()`` returns the same
  number, used by the ``thread_limit=None`` default in every
  solver entry point.

User-set env vars are **never overwritten** — if you've exported
``OMP_NUM_THREADS=2``, that wins.

CPU-affinity pin (opt-in)
~~~~~~~~~~~~~~~~~~~~~~~~~

Per-pool caps don't actually limit total threads — MKL Pardiso,
OpenBLAS, libmumps' internal pthread pool, and the python OpenMP
runtime each spawn their *own* pool, so total live threads on the
PBS-rotor cyclic bench reach 23+ even with every cap env var set to
8.  The kernel happily schedules all 23 across every core on the
host: the cap is "honored" per-pool but the *process* still uses
the whole machine.

Set ``FEMORPH_SOLVER_AFFINITY`` to opt INTO a strict CPU-affinity
pin (the same mechanism ``taskset -c 0-N`` uses, applied in
process):

============================================  ============================
``FEMORPH_SOLVER_AFFINITY``                   Result
============================================  ============================
unset / empty / ``"off"``                     no pin (default)
``"8"``                                       pin to first 8 cores
``"0,2,4"``                                   pin to those CPU IDs
``"8-15"``                                    pin to cores 8–15 (range)
``"0,4-7,12"``                                mix of list + range
============================================  ============================

Why opt-in: setting ``FEMORPH_SOLVER_NUM_THREADS=4`` is a per-pool
*hint*, not a "pin to 4 cores" demand.  CI workflows and batch
scripts often set the cap as a budget signal without expecting
pinning — pinning a worker process to 4 cores while xdist legitimately
needs access to more cores would deadlock the test pool.  Decoupling
the cap from the pin lets users set both intentionally.

You can also pin at runtime via :func:`femorph_solver.pin_cpu_affinity`
(``pin_cpu_affinity(8)`` ties the current process to cores 0-7).
Linux-only; macOS / Windows / restricted-syscall containers fall
through as a soft no-op.

Why 8?
~~~~~~

PBS rotor (``pbs-sector-hd.cdb``, 494,766 DOFs, MUMPS factor +
ARPACK σ=0 shift-invert, n_modes=4) on a 32-core host:

================  =================  =================  =================
Threads (kwarg)   smart default      kwarg-only         env-only
                  (``OMP=8``)        (no env vars)      (``OMP=N``)
================  =================  =================  =================
1                 35.14 s            54.59 s            50.89 s
2                 37.23 s            47.77 s            40.79 s
4                 34.38 s            42.14 s            35.58 s
**8**             **35.06 s**        41.38 s            **34.03 s**
16                34.33 s            39.81 s            34.61 s
32                36.42 s            43.64 s            36.26 s
================  =================  =================  =================

Three observations:

1. **8 threads is the sweet spot.**  1 → 8 buys 1.5×; 8 → 16
   buys nothing; 32 over-subscribes the 32 cores and slows
   down.  The eigensolve loop serialises through ARPACK
   iterations + scipy ``splu`` (cyclic-augment helper); only
   the MUMPS factor parallelises freely.
2. **kwarg-only without env vars is 5–7 s slower** than the
   env-var path because BLAS / MUMPS pools are sized at
   pool-init time; ``threadpoolctl`` can shrink an existing
   pool but can't grow it past its init-time max.  The
   smart default makes this difference disappear by setting
   the env vars before any BLAS-loaded library imports.
3. **Smart default is ``min(8, ncores)``** so a 4-core laptop
   gets 4, a 32-core workstation gets 8, a 96-core EPYC gets 8.
   Above 8 the structural-K factor stops scaling and the OS
   scheduler dominates.

Opt-out and override
~~~~~~~~~~~~~~~~~~~~

Set ``FEMORPH_SOLVER_NUM_THREADS`` *before* importing
``femorph_solver``:

============================================  ============================
``FEMORPH_SOLVER_NUM_THREADS`` (env)          Result
============================================  ============================
unset (default)                               ``min(8, ncores)`` — auto
``0``                                         disable cap (host cores)
``4`` (any positive integer)                  cap at 4
============================================  ============================

You can also call :func:`femorph_solver.set_thread_limit` at
runtime, but **only** the ``threadpoolctl``-managed BLAS pools
will resize — MUMPS / nested-OpenMP teams keep their
init-time pool size.  Always prefer the env var if you need a
fresh cap from the start.

Linear-backend defaults
-----------------------

``Model.solve_static(linear_solver="auto")`` resolves to the
fastest installed direct backend:

==========  ==============  =====================================
Priority    Backend         Picked when…
==========  ==============  =====================================
1           **Pardiso**     ``pip install pypardiso`` succeeded
                            (Intel MKL); auto-pin matrix is SPD.
2           **CHOLMOD**     ``pip install scikit-sparse`` and
                            ``libsuitesparse-dev`` are present.
3           **MUMPS**       ``pip install python-mumps`` and
                            ``libmumps`` are present.  Best
                            when out-of-core (``ooc=True``) is
                            useful.
4           **UMFPACK**     ``pip install scikit-umfpack``.
5           **SuperLU**     SciPy default; no extra install.
==========  ==============  =====================================

Why this order, and per-backend gotchas:

Pardiso
~~~~~~~

* **Fastest** on Intel-class CPUs, period.  Beats CHOLMOD by
  20–40 % on dense 3-D solid models because the two-pass
  factorisation amortises the symbolic step better.
* Honours ``MKL_NUM_THREADS`` at init time; threadpoolctl
  manages it correctly.
* Ships via ``pip install pypardiso`` which pulls in MKL —
  not redistributable in commercial builds.  License: MKL
  EULA (proprietary).
* Cluster nodes often have MKL globally; check before
  installing the wheel.

CHOLMOD
~~~~~~~

* Competitive with Pardiso on **sparse-graph models**
  (beam-shell meshes); pays a constant fill-in overhead on
  dense connectivity.
* Symmetric-positive-definite only — auto-chain falls
  through to MUMPS for indefinite K.
* CeCILL / GPL-compatible — redistributable.
* Install: ``pip install scikit-sparse`` + system
  ``libsuitesparse-dev``.

MUMPS
~~~~~

* **Best for out-of-core** (``Solver(K, ooc=True)``) — spills
  the factor to ``$MUMPS_TMPDIR``, RAM stops being the
  ceiling.  Pardiso's OOC mode is less battle-tested.
* On a single shared-memory node MUMPS is 1.5–3× slower than
  Pardiso at the same DOF count and uses 10–30 % more
  factor memory.  Pick it for OOC, not in-core speed.
* **Ships its own OpenMP runtime** that ``threadpoolctl``
  cannot cap.  femorph-solver sets ``ICNTL(16)`` from the
  current ``thread_limit`` after ``analyze()`` so the kwarg
  path works without env vars (PR #798).
* **Multi-solve gotcha**: libmumps's ``JOB=-2`` (terminate)
  segfaults on large factors with ``free(): invalid next
  size``.  ``MUMPSSolver.__del__`` skips that call and lets
  the OS reclaim allocations at process exit — multiple
  solves in one process are now safe but each large factor
  leaks until the process ends.  See PR #798.
* Install: ``pip install python-mumps`` + system ``libmumps``.

UMFPACK
~~~~~~~

* Stable, mature, but no parallel factor.  Slower than
  Pardiso / CHOLMOD on multi-core; useful when you want a
  deterministic single-threaded factor (CI reproducibility
  testing).
* Install ``scikit-umfpack`` is fragile on slim CI images
  (meson + swig + Python.h chain often breaks).  We don't
  add it to the aggregate ``[solvers]`` extra; install
  manually if needed.

SuperLU
~~~~~~~

* Pure-SciPy fallback.  No external dep.  ~2-3× slower than
  the optimised backends; on the PBS rotor in our bench it
  also OOMs for the 12k-DOF cyclic-augmented operator.
* Always available — the safety net.

Eigen-backend defaults
----------------------

``Model.solve_modal(eigen_solver="auto")`` always resolves to
**ARPACK** as of 2026-05.  PRIMME stays registered (the
cyclic-modal solver uses it through an OPinv-aware code path),
but the simple modal autotune doesn't pick it.  See
:doc:`pbs-rotor-comparison` for the rationale and the bench
numbers that motivated the change.

Quick version: ARPACK + a sparse direct linear backend
(``mkl_direct`` / ``pardiso`` / ``mumps``) beats MAPDL Block
Lanczos by 25 % on the 500k-DOF PBS rotor at 8 threads.  The
~10–20 % marginal speedup PRIMME might offer on a narrow size
band (10k–100k DOFs, when its no-OPinv Davidson path converges)
isn't worth exposing the matvec-error footgun PRIMME hits on
larger / ill-conditioned K.  Users who want PRIMME explicitly
can still pass ``eigen_solver="primme"``.

ARPACK
~~~~~~

* SciPy default (``scipy.sparse.linalg.eigsh``).  Lanczos
  iteration with implicit restart (Sorensen 1992).
* Best for **shift-invert** (``sigma != 0``) — PRIMME's
  python binding doesn't expose the sigma-shift mode.
* Always available; no extra dep.
* The autotune's only choice for ``solve_modal`` post-2026-05.

PRIMME
~~~~~~

* Block-Davidson with GD+k restart (Stathopoulos & McCombs
  2010) — beats ARPACK on the small-modal sweet spot
  (4–10 modes) on real-world structural matrices,
  *when an OPinv is supplied*.
* Honours ``threadpoolctl`` thread limits — PRIMME calls
  scipy/numpy BLAS internally and doesn't ship its own OpenMP
  runtime.
* Optional dep: not in core ``femorph-solver`` install
  because the upstream PyPI package ships **sdist only** —
  every ``pip install primme`` triggers a 60-90 s source
  build that needs gcc + gfortran + BLAS / LAPACK headers.
  On NixOS / minimal Ubuntu images the link step fails with
  ``ld: cannot find -llapack``.

Installing PRIMME via ``femorph/pprime``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

`femorph/pprime <https://github.com/femorph/pprime>`_ is the
internal repo that builds proper **manylinux_2_28** PRIMME
wheels (cp310–cp314) so every customer host can
``pip install`` without a fortran toolchain.

In-cluster (femorph kubernetes pods)::

    pip install --index-url \
        http://devpi.femorph-devpi.svc.cluster.local/root/femorph/+simple/ \
        primme

External hosts (workstations, executor) — install from the
GitHub release attached to the latest tag::

    gh release download v3.2.3 --repo femorph/pprime \
        --pattern "*cp$(python -c 'import sys; print(f\"{sys.version_info.major}{sys.version_info.minor}\")')*" \
        -D /tmp/
    pip install /tmp/primme-3.2.3-cp*.whl

Dev workstation note: if you're on `executor` or a similar
NixOS box, install via the GitHub-release path above — the
manylinux wheel bundles libgfortran / libblas / libgomp via
``auditwheel repair`` so it Just Works without a system
fortran.

The fallback (no wheel for your interpreter) is the upstream
PyPI source build::

    sudo apt install gfortran libblas-dev liblapack-dev
    pip install primme

After install, ``femorph_solver``'s autotune picks PRIMME
automatically when ``n_modes >= 4`` and ``sigma == 0``.

Verifying PRIMME respects thread limits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This took two tries to nail down, and the answer is more
subtle than "the cap works".

5,000-DOF banded SPD problem, ``primme.eigsh`` with
``threadpoolctl.limit(N)`` applied at runtime, **no env vars
set at process start**:

=========================  ============  ============  ==================
Threads (cap)              wall time     CPU time      effective threads
=========================  ============  ============  ==================
1                          0.28 s        1.10 s        **3.9** (!)
2                          0.29 s        1.23 s        4.3
4                          0.29 s        1.38 s        4.7
8                          0.29 s        1.18 s        4.1
16                         0.28 s        1.17 s        4.2
32                         0.28 s        1.35 s        4.8
=========================  ============  ============  ==================

Threadpoolctl reports the BLAS pool capped at the requested
size (``pool=[1, 1]`` for cap=1) but **CPU time is 4-5× wall
time regardless** — the runtime cap isn't actually limiting
parallel work.  Same problem with the same ``threadpoolctl.limit(1)``
context, but with ``OMP_NUM_THREADS=1`` set **before** Python
starts:

=========================  ============  ============  ==================
Threads (cap)              wall time     CPU time      effective threads
=========================  ============  ============  ==================
``OMP=1`` env + cap=1      0.29 s        0.29 s        **1.0** (clean)
=========================  ============  ============  ==================

What's happening:

* The ``pprime`` wheel bundles ``libblas-fe34f726.so.3.8.0`` —
  reference netlib BLAS, no OpenMP / pthreads symbols
  (``nm -D`` confirms).  PRIMME's dense subspace algebra is
  single-threaded through this bundled BLAS.
* The 4-5 effective threads come from numpy + scipy's
  ``libscipy_openblas`` pool — 32 worker threads pre-spawned
  at process start.  ``threadpoolctl.limit(N)`` adjusts the
  *active worker count* per call, but cannot un-spawn the
  pool's pre-existing threads.  Their idle scheduling and
  memory accesses still burn CPU even when the active count
  is supposed to be 1.
* Setting ``OMP_NUM_THREADS=N`` **before** ``import numpy``
  init's the pool with exactly N threads → no over-spawn,
  no phantom CPU burn.  This is what femorph-solver's
  smart default does.

**Bottom line**: PRIMME does respect threadpoolctl, but the
realised parallel work depends on (1) whether PRIMME's
inner BLAS calls go through scipy's OpenBLAS or the bundled
netlib BLAS — pprime's wheel uses the latter — and (2)
whether OpenBLAS was init'd with the right thread cap.  The
smart default + ``OMP_NUM_THREADS`` env-var seeding handles
both correctly.

Caveat for the maintainer: if you want PRIMME itself to
parallelise (not just the scipy BLAS calls it makes), the
``pprime`` wheel would need to be rebuilt against
multi-threaded OpenBLAS or MKL.  That's tracked separately;
for the typical small-modal use case (4-10 modes, sparse K)
PRIMME's serial path is already faster than ARPACK.

Iterative backends
------------------

For models that don't fit in the direct-factor RAM ceiling
(roughly > 1 M DOFs in a 32-GB workstation):

CG + AMG preconditioner
~~~~~~~~~~~~~~~~~~~~~~~

* ``Model.solve_static(linear_solver="cg+pyamg")`` — conjugate
  gradient with classical algebraic multigrid preconditioner
  via ``pyamg``.
* SPD only.  Memory scales as :math:`O(N)` instead of the
  direct factor's :math:`O(N^{1.5})`.
* Cost ~3-5× the direct factor's wall time at the
  crossover (~1 M DOF), but still solvable when the direct
  factor would OOM.
* Install: ``pip install pyamg``.

GMRES
~~~~~

* For non-SPD K (large rotation, contact, mixed-formulation
  shells).  ``Model.solve_static(linear_solver="gmres")``.
* No preconditioner by default — converges slowly on
  ill-conditioned K.  Use sparingly.

Known issues and workarounds
----------------------------

MUMPS double-free on teardown
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Pre-existing libmumps bug: large factors hit ``free():
invalid next size`` / ``double free or corruption (!prev)``
during the C-level cleanup.  ``MUMPSSolver.__del__`` skips
the libmumps ``JOB=-2`` call in PR #798 — multiple solves
in one process are now stable, at the cost of per-solve
memory leak (reclaimed at process exit).

If you observe this in a long-running service that never
exits, switch to a subprocess pool: each solve runs in a
fresh subprocess and the OS reclaims everything when the
subprocess ends.

Pardiso threading on AMD
~~~~~~~~~~~~~~~~~~~~~~~~

MKL Pardiso on AMD CPUs can fall back to a serial path
unless ``MKL_DEBUG_CPU_TYPE=5`` is exported (forces
``avx2`` codepath).  Doesn't affect Intel.

Eigenvalue regularisation (``ε I``) artefacts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When ``M_ff`` is rank-deficient (deck pins a load DOF, BEAM
with no DENS, …) the dense ``la.eigh`` fails and we add
``ε = 1e-14 · trace(K) / N`` to ``M``.  Phantom modes go to
:math:`\omega^2 \approx \text{trace}(K) / \varepsilon` —
very high frequency, doesn't pollute the lowest ``n_req``.
But if you ask for many modes the phantoms can appear at
the high end; check ``ModalResult.frequency`` for an
unphysical jump.

References
----------

* Bench data and the script that produced it:
  ``tests/perf/bench_modal_threads.py`` (PR #798).
* PBS rotor model: ``femorph/pbs-demo``.
* PRIMME wheels: ``femorph/pprime`` (in-cluster devpi or
  GitHub Releases).