Solver defaults and threading ============================= This page documents *why* femorph-solver picks the defaults it does and *how* to change them. Every default here was set after a benchmark sweep on a real model — citations point at ``tests/perf/`` artifacts and the bench script that produced the numbers. .. contents:: On this page :local: :depth: 2 Smart thread default -------------------- At import time, ``femorph_solver`` sets the BLAS / OpenMP thread cap to ``min(8, host_cores)``. The cap is applied two ways: * ``OMP_NUM_THREADS`` / ``MKL_NUM_THREADS`` / ``OPENBLAS_NUM_THREADS`` are seeded so BLAS pools come up at the right size from the start. * ``femorph_solver._threads.get_thread_limit()`` returns the same number, used by the ``thread_limit=None`` default in every solver entry point. User-set env vars are **never overwritten** — if you've exported ``OMP_NUM_THREADS=2``, that wins. CPU-affinity pin (opt-in) ~~~~~~~~~~~~~~~~~~~~~~~~~ Per-pool caps don't actually limit total threads — MKL Pardiso, OpenBLAS, libmumps' internal pthread pool, and the python OpenMP runtime each spawn their *own* pool, so total live threads on the PBS-rotor cyclic bench reach 23+ even with every cap env var set to 8. The kernel happily schedules all 23 across every core on the host: the cap is "honored" per-pool but the *process* still uses the whole machine. Set ``FEMORPH_SOLVER_AFFINITY`` to opt INTO a strict CPU-affinity pin (the same mechanism ``taskset -c 0-N`` uses, applied in process): ============================================ ============================ ``FEMORPH_SOLVER_AFFINITY`` Result ============================================ ============================ unset / empty / ``"off"`` no pin (default) ``"8"`` pin to first 8 cores ``"0,2,4"`` pin to those CPU IDs ``"8-15"`` pin to cores 8–15 (range) ``"0,4-7,12"`` mix of list + range ============================================ ============================ Why opt-in: setting ``FEMORPH_SOLVER_NUM_THREADS=4`` is a per-pool *hint*, not a "pin to 4 cores" demand. CI workflows and batch scripts often set the cap as a budget signal without expecting pinning — pinning a worker process to 4 cores while xdist legitimately needs access to more cores would deadlock the test pool. Decoupling the cap from the pin lets users set both intentionally. You can also pin at runtime via :func:`femorph_solver.pin_cpu_affinity` (``pin_cpu_affinity(8)`` ties the current process to cores 0-7). Linux-only; macOS / Windows / restricted-syscall containers fall through as a soft no-op. Why 8? ~~~~~~ PBS rotor (``pbs-sector-hd.cdb``, 494,766 DOFs, MUMPS factor + ARPACK σ=0 shift-invert, n_modes=4) on a 32-core host: ================ ================= ================= ================= Threads (kwarg) smart default kwarg-only env-only (``OMP=8``) (no env vars) (``OMP=N``) ================ ================= ================= ================= 1 35.14 s 54.59 s 50.89 s 2 37.23 s 47.77 s 40.79 s 4 34.38 s 42.14 s 35.58 s **8** **35.06 s** 41.38 s **34.03 s** 16 34.33 s 39.81 s 34.61 s 32 36.42 s 43.64 s 36.26 s ================ ================= ================= ================= Three observations: 1. **8 threads is the sweet spot.** 1 → 8 buys 1.5×; 8 → 16 buys nothing; 32 over-subscribes the 32 cores and slows down. The eigensolve loop serialises through ARPACK iterations + scipy ``splu`` (cyclic-augment helper); only the MUMPS factor parallelises freely. 2. **kwarg-only without env vars is 5–7 s slower** than the env-var path because BLAS / MUMPS pools are sized at pool-init time; ``threadpoolctl`` can shrink an existing pool but can't grow it past its init-time max. The smart default makes this difference disappear by setting the env vars before any BLAS-loaded library imports. 3. **Smart default is ``min(8, ncores)``** so a 4-core laptop gets 4, a 32-core workstation gets 8, a 96-core EPYC gets 8. Above 8 the structural-K factor stops scaling and the OS scheduler dominates. Opt-out and override ~~~~~~~~~~~~~~~~~~~~ Set ``FEMORPH_SOLVER_NUM_THREADS`` *before* importing ``femorph_solver``: ============================================ ============================ ``FEMORPH_SOLVER_NUM_THREADS`` (env) Result ============================================ ============================ unset (default) ``min(8, ncores)`` — auto ``0`` disable cap (host cores) ``4`` (any positive integer) cap at 4 ============================================ ============================ You can also call :func:`femorph_solver.set_thread_limit` at runtime, but **only** the ``threadpoolctl``-managed BLAS pools will resize — MUMPS / nested-OpenMP teams keep their init-time pool size. Always prefer the env var if you need a fresh cap from the start. Linear-backend defaults ----------------------- ``Model.solve_static(linear_solver="auto")`` resolves to the fastest installed direct backend: ========== ============== ===================================== Priority Backend Picked when… ========== ============== ===================================== 1 **Pardiso** ``pip install pypardiso`` succeeded (Intel MKL); auto-pin matrix is SPD. 2 **CHOLMOD** ``pip install scikit-sparse`` and ``libsuitesparse-dev`` are present. 3 **MUMPS** ``pip install python-mumps`` and ``libmumps`` are present. Best when out-of-core (``ooc=True``) is useful. 4 **UMFPACK** ``pip install scikit-umfpack``. 5 **SuperLU** SciPy default; no extra install. ========== ============== ===================================== Why this order, and per-backend gotchas: Pardiso ~~~~~~~ * **Fastest** on Intel-class CPUs, period. Beats CHOLMOD by 20–40 % on dense 3-D solid models because the two-pass factorisation amortises the symbolic step better. * Honours ``MKL_NUM_THREADS`` at init time; threadpoolctl manages it correctly. * Ships via ``pip install pypardiso`` which pulls in MKL — not redistributable in commercial builds. License: MKL EULA (proprietary). * Cluster nodes often have MKL globally; check before installing the wheel. CHOLMOD ~~~~~~~ * Competitive with Pardiso on **sparse-graph models** (beam-shell meshes); pays a constant fill-in overhead on dense connectivity. * Symmetric-positive-definite only — auto-chain falls through to MUMPS for indefinite K. * CeCILL / GPL-compatible — redistributable. * Install: ``pip install scikit-sparse`` + system ``libsuitesparse-dev``. MUMPS ~~~~~ * **Best for out-of-core** (``Solver(K, ooc=True)``) — spills the factor to ``$MUMPS_TMPDIR``, RAM stops being the ceiling. Pardiso's OOC mode is less battle-tested. * On a single shared-memory node MUMPS is 1.5–3× slower than Pardiso at the same DOF count and uses 10–30 % more factor memory. Pick it for OOC, not in-core speed. * **Ships its own OpenMP runtime** that ``threadpoolctl`` cannot cap. femorph-solver sets ``ICNTL(16)`` from the current ``thread_limit`` after ``analyze()`` so the kwarg path works without env vars (PR #798). * **Multi-solve gotcha**: libmumps's ``JOB=-2`` (terminate) segfaults on large factors with ``free(): invalid next size``. ``MUMPSSolver.__del__`` skips that call and lets the OS reclaim allocations at process exit — multiple solves in one process are now safe but each large factor leaks until the process ends. See PR #798. * Install: ``pip install python-mumps`` + system ``libmumps``. UMFPACK ~~~~~~~ * Stable, mature, but no parallel factor. Slower than Pardiso / CHOLMOD on multi-core; useful when you want a deterministic single-threaded factor (CI reproducibility testing). * Install ``scikit-umfpack`` is fragile on slim CI images (meson + swig + Python.h chain often breaks). We don't add it to the aggregate ``[solvers]`` extra; install manually if needed. SuperLU ~~~~~~~ * Pure-SciPy fallback. No external dep. ~2-3× slower than the optimised backends; on the PBS rotor in our bench it also OOMs for the 12k-DOF cyclic-augmented operator. * Always available — the safety net. Eigen-backend defaults ---------------------- ``Model.solve_modal(eigen_solver="auto")`` always resolves to **ARPACK** as of 2026-05. PRIMME stays registered (the cyclic-modal solver uses it through an OPinv-aware code path), but the simple modal autotune doesn't pick it. See :doc:`pbs-rotor-comparison` for the rationale and the bench numbers that motivated the change. Quick version: ARPACK + a sparse direct linear backend (``mkl_direct`` / ``pardiso`` / ``mumps``) beats MAPDL Block Lanczos by 25 % on the 500k-DOF PBS rotor at 8 threads. The ~10–20 % marginal speedup PRIMME might offer on a narrow size band (10k–100k DOFs, when its no-OPinv Davidson path converges) isn't worth exposing the matvec-error footgun PRIMME hits on larger / ill-conditioned K. Users who want PRIMME explicitly can still pass ``eigen_solver="primme"``. ARPACK ~~~~~~ * SciPy default (``scipy.sparse.linalg.eigsh``). Lanczos iteration with implicit restart (Sorensen 1992). * Best for **shift-invert** (``sigma != 0``) — PRIMME's python binding doesn't expose the sigma-shift mode. * Always available; no extra dep. * The autotune's only choice for ``solve_modal`` post-2026-05. PRIMME ~~~~~~ * Block-Davidson with GD+k restart (Stathopoulos & McCombs 2010) — beats ARPACK on the small-modal sweet spot (4–10 modes) on real-world structural matrices, *when an OPinv is supplied*. * Honours ``threadpoolctl`` thread limits — PRIMME calls scipy/numpy BLAS internally and doesn't ship its own OpenMP runtime. * Optional dep: not in core ``femorph-solver`` install because the upstream PyPI package ships **sdist only** — every ``pip install primme`` triggers a 60-90 s source build that needs gcc + gfortran + BLAS / LAPACK headers. On NixOS / minimal Ubuntu images the link step fails with ``ld: cannot find -llapack``. Installing PRIMME via ``femorph/pprime`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ `femorph/pprime `_ is the internal repo that builds proper **manylinux_2_28** PRIMME wheels (cp310–cp314) so every customer host can ``pip install`` without a fortran toolchain. In-cluster (femorph kubernetes pods):: pip install --index-url \ http://devpi.femorph-devpi.svc.cluster.local/root/femorph/+simple/ \ primme External hosts (workstations, executor) — install from the GitHub release attached to the latest tag:: gh release download v3.2.3 --repo femorph/pprime \ --pattern "*cp$(python -c 'import sys; print(f\"{sys.version_info.major}{sys.version_info.minor}\")')*" \ -D /tmp/ pip install /tmp/primme-3.2.3-cp*.whl Dev workstation note: if you're on `executor` or a similar NixOS box, install via the GitHub-release path above — the manylinux wheel bundles libgfortran / libblas / libgomp via ``auditwheel repair`` so it Just Works without a system fortran. The fallback (no wheel for your interpreter) is the upstream PyPI source build:: sudo apt install gfortran libblas-dev liblapack-dev pip install primme After install, ``femorph_solver``'s autotune picks PRIMME automatically when ``n_modes >= 4`` and ``sigma == 0``. Verifying PRIMME respects thread limits ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This took two tries to nail down, and the answer is more subtle than "the cap works". 5,000-DOF banded SPD problem, ``primme.eigsh`` with ``threadpoolctl.limit(N)`` applied at runtime, **no env vars set at process start**: ========================= ============ ============ ================== Threads (cap) wall time CPU time effective threads ========================= ============ ============ ================== 1 0.28 s 1.10 s **3.9** (!) 2 0.29 s 1.23 s 4.3 4 0.29 s 1.38 s 4.7 8 0.29 s 1.18 s 4.1 16 0.28 s 1.17 s 4.2 32 0.28 s 1.35 s 4.8 ========================= ============ ============ ================== Threadpoolctl reports the BLAS pool capped at the requested size (``pool=[1, 1]`` for cap=1) but **CPU time is 4-5× wall time regardless** — the runtime cap isn't actually limiting parallel work. Same problem with the same ``threadpoolctl.limit(1)`` context, but with ``OMP_NUM_THREADS=1`` set **before** Python starts: ========================= ============ ============ ================== Threads (cap) wall time CPU time effective threads ========================= ============ ============ ================== ``OMP=1`` env + cap=1 0.29 s 0.29 s **1.0** (clean) ========================= ============ ============ ================== What's happening: * The ``pprime`` wheel bundles ``libblas-fe34f726.so.3.8.0`` — reference netlib BLAS, no OpenMP / pthreads symbols (``nm -D`` confirms). PRIMME's dense subspace algebra is single-threaded through this bundled BLAS. * The 4-5 effective threads come from numpy + scipy's ``libscipy_openblas`` pool — 32 worker threads pre-spawned at process start. ``threadpoolctl.limit(N)`` adjusts the *active worker count* per call, but cannot un-spawn the pool's pre-existing threads. Their idle scheduling and memory accesses still burn CPU even when the active count is supposed to be 1. * Setting ``OMP_NUM_THREADS=N`` **before** ``import numpy`` init's the pool with exactly N threads → no over-spawn, no phantom CPU burn. This is what femorph-solver's smart default does. **Bottom line**: PRIMME does respect threadpoolctl, but the realised parallel work depends on (1) whether PRIMME's inner BLAS calls go through scipy's OpenBLAS or the bundled netlib BLAS — pprime's wheel uses the latter — and (2) whether OpenBLAS was init'd with the right thread cap. The smart default + ``OMP_NUM_THREADS`` env-var seeding handles both correctly. Caveat for the maintainer: if you want PRIMME itself to parallelise (not just the scipy BLAS calls it makes), the ``pprime`` wheel would need to be rebuilt against multi-threaded OpenBLAS or MKL. That's tracked separately; for the typical small-modal use case (4-10 modes, sparse K) PRIMME's serial path is already faster than ARPACK. Iterative backends ------------------ For models that don't fit in the direct-factor RAM ceiling (roughly > 1 M DOFs in a 32-GB workstation): CG + AMG preconditioner ~~~~~~~~~~~~~~~~~~~~~~~ * ``Model.solve_static(linear_solver="cg+pyamg")`` — conjugate gradient with classical algebraic multigrid preconditioner via ``pyamg``. * SPD only. Memory scales as :math:`O(N)` instead of the direct factor's :math:`O(N^{1.5})`. * Cost ~3-5× the direct factor's wall time at the crossover (~1 M DOF), but still solvable when the direct factor would OOM. * Install: ``pip install pyamg``. GMRES ~~~~~ * For non-SPD K (large rotation, contact, mixed-formulation shells). ``Model.solve_static(linear_solver="gmres")``. * No preconditioner by default — converges slowly on ill-conditioned K. Use sparingly. Known issues and workarounds ---------------------------- MUMPS double-free on teardown ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Pre-existing libmumps bug: large factors hit ``free(): invalid next size`` / ``double free or corruption (!prev)`` during the C-level cleanup. ``MUMPSSolver.__del__`` skips the libmumps ``JOB=-2`` call in PR #798 — multiple solves in one process are now stable, at the cost of per-solve memory leak (reclaimed at process exit). If you observe this in a long-running service that never exits, switch to a subprocess pool: each solve runs in a fresh subprocess and the OS reclaims everything when the subprocess ends. Pardiso threading on AMD ~~~~~~~~~~~~~~~~~~~~~~~~ MKL Pardiso on AMD CPUs can fall back to a serial path unless ``MKL_DEBUG_CPU_TYPE=5`` is exported (forces ``avx2`` codepath). Doesn't affect Intel. Eigenvalue regularisation (``ε I``) artefacts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When ``M_ff`` is rank-deficient (deck pins a load DOF, BEAM with no DENS, …) the dense ``la.eigh`` fails and we add ``ε = 1e-14 · trace(K) / N`` to ``M``. Phantom modes go to :math:`\omega^2 \approx \text{trace}(K) / \varepsilon` — very high frequency, doesn't pollute the lowest ``n_req``. But if you ask for many modes the phantoms can appear at the high end; check ``ModalResult.frequency`` for an unphysical jump. References ---------- * Bench data and the script that produced it: ``tests/perf/bench_modal_threads.py`` (PR #798). * PBS rotor model: ``femorph/pbs-demo``. * PRIMME wheels: ``femorph/pprime`` (in-cluster devpi or GitHub Releases).