Solver defaults and threading#

This page documents why femorph-solver picks the defaults it does and how to change them. Every default here was set after a benchmark sweep on a real model — citations point at tests/perf/ artifacts and the bench script that produced the numbers.

Smart thread default#

At import time, femorph_solver sets the BLAS / OpenMP thread cap to min(8, host_cores). The cap is applied two ways:

  • OMP_NUM_THREADS / MKL_NUM_THREADS / OPENBLAS_NUM_THREADS are seeded so BLAS pools come up at the right size from the start.

  • femorph_solver._threads.get_thread_limit() returns the same number, used by the thread_limit=None default in every solver entry point.

User-set env vars are never overwritten — if you’ve exported OMP_NUM_THREADS=2, that wins.

CPU-affinity pin (opt-in)#

Per-pool caps don’t actually limit total threads — MKL Pardiso, OpenBLAS, libmumps’ internal pthread pool, and the python OpenMP runtime each spawn their own pool, so total live threads on the PBS-rotor cyclic bench reach 23+ even with every cap env var set to 8. The kernel happily schedules all 23 across every core on the host: the cap is “honored” per-pool but the process still uses the whole machine.

Set FEMORPH_SOLVER_AFFINITY to opt INTO a strict CPU-affinity pin (the same mechanism taskset -c 0-N uses, applied in process):

FEMORPH_SOLVER_AFFINITY

Result

unset / empty / "off"

no pin (default)

"8"

pin to first 8 cores

"0,2,4"

pin to those CPU IDs

"8-15"

pin to cores 8–15 (range)

"0,4-7,12"

mix of list + range

Why opt-in: setting FEMORPH_SOLVER_NUM_THREADS=4 is a per-pool hint, not a “pin to 4 cores” demand. CI workflows and batch scripts often set the cap as a budget signal without expecting pinning — pinning a worker process to 4 cores while xdist legitimately needs access to more cores would deadlock the test pool. Decoupling the cap from the pin lets users set both intentionally.

You can also pin at runtime via femorph_solver.pin_cpu_affinity() (pin_cpu_affinity(8) ties the current process to cores 0-7). Linux-only; macOS / Windows / restricted-syscall containers fall through as a soft no-op.

Why 8?#

PBS rotor (pbs-sector-hd.cdb, 494,766 DOFs, MUMPS factor + ARPACK σ=0 shift-invert, n_modes=4) on a 32-core host:

Threads (kwarg)

smart default (OMP=8)

kwarg-only (no env vars)

env-only (OMP=N)

1

35.14 s

54.59 s

50.89 s

2

37.23 s

47.77 s

40.79 s

4

34.38 s

42.14 s

35.58 s

8

35.06 s

41.38 s

34.03 s

16

34.33 s

39.81 s

34.61 s

32

36.42 s

43.64 s

36.26 s

Three observations:

  1. 8 threads is the sweet spot. 1 → 8 buys 1.5×; 8 → 16 buys nothing; 32 over-subscribes the 32 cores and slows down. The eigensolve loop serialises through ARPACK iterations + scipy splu (cyclic-augment helper); only the MUMPS factor parallelises freely.

  2. kwarg-only without env vars is 5–7 s slower than the env-var path because BLAS / MUMPS pools are sized at pool-init time; threadpoolctl can shrink an existing pool but can’t grow it past its init-time max. The smart default makes this difference disappear by setting the env vars before any BLAS-loaded library imports.

  3. Smart default is ``min(8, ncores)`` so a 4-core laptop gets 4, a 32-core workstation gets 8, a 96-core EPYC gets 8. Above 8 the structural-K factor stops scaling and the OS scheduler dominates.

Opt-out and override#

Set FEMORPH_SOLVER_NUM_THREADS before importing femorph_solver:

FEMORPH_SOLVER_NUM_THREADS (env)

Result

unset (default)

min(8, ncores) — auto

0

disable cap (host cores)

4 (any positive integer)

cap at 4

You can also call femorph_solver.set_thread_limit() at runtime, but only the threadpoolctl-managed BLAS pools will resize — MUMPS / nested-OpenMP teams keep their init-time pool size. Always prefer the env var if you need a fresh cap from the start.

Linear-backend defaults#

Model.solve_static(linear_solver="auto") resolves to the fastest installed direct backend:

Priority

Backend

Picked when…

1

Pardiso

pip install pypardiso succeeded (Intel MKL); auto-pin matrix is SPD.

2

CHOLMOD

pip install scikit-sparse and libsuitesparse-dev are present.

3

MUMPS

pip install python-mumps and libmumps are present. Best when out-of-core (ooc=True) is useful.

4

UMFPACK

pip install scikit-umfpack.

5

SuperLU

SciPy default; no extra install.

Why this order, and per-backend gotchas:

Pardiso#

  • Fastest on Intel-class CPUs, period. Beats CHOLMOD by 20–40 % on dense 3-D solid models because the two-pass factorisation amortises the symbolic step better.

  • Honours MKL_NUM_THREADS at init time; threadpoolctl manages it correctly.

  • Ships via pip install pypardiso which pulls in MKL — not redistributable in commercial builds. License: MKL EULA (proprietary).

  • Cluster nodes often have MKL globally; check before installing the wheel.

CHOLMOD#

  • Competitive with Pardiso on sparse-graph models (beam-shell meshes); pays a constant fill-in overhead on dense connectivity.

  • Symmetric-positive-definite only — auto-chain falls through to MUMPS for indefinite K.

  • CeCILL / GPL-compatible — redistributable.

  • Install: pip install scikit-sparse + system libsuitesparse-dev.

MUMPS#

  • Best for out-of-core (Solver(K, ooc=True)) — spills the factor to $MUMPS_TMPDIR, RAM stops being the ceiling. Pardiso’s OOC mode is less battle-tested.

  • On a single shared-memory node MUMPS is 1.5–3× slower than Pardiso at the same DOF count and uses 10–30 % more factor memory. Pick it for OOC, not in-core speed.

  • Ships its own OpenMP runtime that threadpoolctl cannot cap. femorph-solver sets ICNTL(16) from the current thread_limit after analyze() so the kwarg path works without env vars (PR #798).

  • Multi-solve gotcha: libmumps’s JOB=-2 (terminate) segfaults on large factors with free(): invalid next size. MUMPSSolver.__del__ skips that call and lets the OS reclaim allocations at process exit — multiple solves in one process are now safe but each large factor leaks until the process ends. See PR #798.

  • Install: pip install python-mumps + system libmumps.

UMFPACK#

  • Stable, mature, but no parallel factor. Slower than Pardiso / CHOLMOD on multi-core; useful when you want a deterministic single-threaded factor (CI reproducibility testing).

  • Install scikit-umfpack is fragile on slim CI images (meson + swig + Python.h chain often breaks). We don’t add it to the aggregate [solvers] extra; install manually if needed.

SuperLU#

  • Pure-SciPy fallback. No external dep. ~2-3× slower than the optimised backends; on the PBS rotor in our bench it also OOMs for the 12k-DOF cyclic-augmented operator.

  • Always available — the safety net.

Eigen-backend defaults#

Model.solve_modal(eigen_solver="auto") always resolves to ARPACK as of 2026-05. PRIMME stays registered (the cyclic-modal solver uses it through an OPinv-aware code path), but the simple modal autotune doesn’t pick it. See PBS rotor — solver head-to-head at 8 threads for the rationale and the bench numbers that motivated the change.

Quick version: ARPACK + a sparse direct linear backend (mkl_direct / pardiso / mumps) beats MAPDL Block Lanczos by 25 % on the 500k-DOF PBS rotor at 8 threads. The ~10–20 % marginal speedup PRIMME might offer on a narrow size band (10k–100k DOFs, when its no-OPinv Davidson path converges) isn’t worth exposing the matvec-error footgun PRIMME hits on larger / ill-conditioned K. Users who want PRIMME explicitly can still pass eigen_solver="primme".

ARPACK#

  • SciPy default (scipy.sparse.linalg.eigsh). Lanczos iteration with implicit restart (Sorensen 1992).

  • Best for shift-invert (sigma != 0) — PRIMME’s python binding doesn’t expose the sigma-shift mode.

  • Always available; no extra dep.

  • The autotune’s only choice for solve_modal post-2026-05.

PRIMME#

  • Block-Davidson with GD+k restart (Stathopoulos & McCombs 2010) — beats ARPACK on the small-modal sweet spot (4–10 modes) on real-world structural matrices, when an OPinv is supplied.

  • Honours threadpoolctl thread limits — PRIMME calls scipy/numpy BLAS internally and doesn’t ship its own OpenMP runtime.

  • Optional dep: not in core femorph-solver install because the upstream PyPI package ships sdist only — every pip install primme triggers a 60-90 s source build that needs gcc + gfortran + BLAS / LAPACK headers. On NixOS / minimal Ubuntu images the link step fails with ld: cannot find -llapack.

Installing PRIMME via femorph/pprime#

femorph/pprime is the internal repo that builds proper manylinux_2_28 PRIMME wheels (cp310–cp314) so every customer host can pip install without a fortran toolchain.

In-cluster (femorph kubernetes pods):

pip install --index-url \
    http://devpi.femorph-devpi.svc.cluster.local/root/femorph/+simple/ \
    primme

External hosts (workstations, executor) — install from the GitHub release attached to the latest tag:

gh release download v3.2.3 --repo femorph/pprime \
    --pattern "*cp$(python -c 'import sys; print(f\"{sys.version_info.major}{sys.version_info.minor}\")')*" \
    -D /tmp/
pip install /tmp/primme-3.2.3-cp*.whl

Dev workstation note: if you’re on executor or a similar NixOS box, install via the GitHub-release path above — the manylinux wheel bundles libgfortran / libblas / libgomp via auditwheel repair so it Just Works without a system fortran.

The fallback (no wheel for your interpreter) is the upstream PyPI source build:

sudo apt install gfortran libblas-dev liblapack-dev
pip install primme

After install, femorph_solver’s autotune picks PRIMME automatically when n_modes >= 4 and sigma == 0.

Verifying PRIMME respects thread limits#

This took two tries to nail down, and the answer is more subtle than “the cap works”.

5,000-DOF banded SPD problem, primme.eigsh with threadpoolctl.limit(N) applied at runtime, no env vars set at process start:

Threads (cap)

wall time

CPU time

effective threads

1

0.28 s

1.10 s

3.9 (!)

2

0.29 s

1.23 s

4.3

4

0.29 s

1.38 s

4.7

8

0.29 s

1.18 s

4.1

16

0.28 s

1.17 s

4.2

32

0.28 s

1.35 s

4.8

Threadpoolctl reports the BLAS pool capped at the requested size (pool=[1, 1] for cap=1) but CPU time is 4-5× wall time regardless — the runtime cap isn’t actually limiting parallel work. Same problem with the same threadpoolctl.limit(1) context, but with OMP_NUM_THREADS=1 set before Python starts:

Threads (cap)

wall time

CPU time

effective threads

OMP=1 env + cap=1

0.29 s

0.29 s

1.0 (clean)

What’s happening:

  • The pprime wheel bundles libblas-fe34f726.so.3.8.0 — reference netlib BLAS, no OpenMP / pthreads symbols (nm -D confirms). PRIMME’s dense subspace algebra is single-threaded through this bundled BLAS.

  • The 4-5 effective threads come from numpy + scipy’s libscipy_openblas pool — 32 worker threads pre-spawned at process start. threadpoolctl.limit(N) adjusts the active worker count per call, but cannot un-spawn the pool’s pre-existing threads. Their idle scheduling and memory accesses still burn CPU even when the active count is supposed to be 1.

  • Setting OMP_NUM_THREADS=N before import numpy init’s the pool with exactly N threads → no over-spawn, no phantom CPU burn. This is what femorph-solver’s smart default does.

Bottom line: PRIMME does respect threadpoolctl, but the realised parallel work depends on (1) whether PRIMME’s inner BLAS calls go through scipy’s OpenBLAS or the bundled netlib BLAS — pprime’s wheel uses the latter — and (2) whether OpenBLAS was init’d with the right thread cap. The smart default + OMP_NUM_THREADS env-var seeding handles both correctly.

Caveat for the maintainer: if you want PRIMME itself to parallelise (not just the scipy BLAS calls it makes), the pprime wheel would need to be rebuilt against multi-threaded OpenBLAS or MKL. That’s tracked separately; for the typical small-modal use case (4-10 modes, sparse K) PRIMME’s serial path is already faster than ARPACK.

Iterative backends#

For models that don’t fit in the direct-factor RAM ceiling (roughly > 1 M DOFs in a 32-GB workstation):

CG + AMG preconditioner#

  • Model.solve_static(linear_solver="cg+pyamg") — conjugate gradient with classical algebraic multigrid preconditioner via pyamg.

  • SPD only. Memory scales as \(O(N)\) instead of the direct factor’s \(O(N^{1.5})\).

  • Cost ~3-5× the direct factor’s wall time at the crossover (~1 M DOF), but still solvable when the direct factor would OOM.

  • Install: pip install pyamg.

GMRES#

  • For non-SPD K (large rotation, contact, mixed-formulation shells). Model.solve_static(linear_solver="gmres").

  • No preconditioner by default — converges slowly on ill-conditioned K. Use sparingly.

Known issues and workarounds#

MUMPS double-free on teardown#

Pre-existing libmumps bug: large factors hit free(): invalid next size / double free or corruption (!prev) during the C-level cleanup. MUMPSSolver.__del__ skips the libmumps JOB=-2 call in PR #798 — multiple solves in one process are now stable, at the cost of per-solve memory leak (reclaimed at process exit).

If you observe this in a long-running service that never exits, switch to a subprocess pool: each solve runs in a fresh subprocess and the OS reclaims everything when the subprocess ends.

Pardiso threading on AMD#

MKL Pardiso on AMD CPUs can fall back to a serial path unless MKL_DEBUG_CPU_TYPE=5 is exported (forces avx2 codepath). Doesn’t affect Intel.

Eigenvalue regularisation (ε I) artefacts#

When M_ff is rank-deficient (deck pins a load DOF, BEAM with no DENS, …) the dense la.eigh fails and we add ε = 1e-14 · trace(K) / N to M. Phantom modes go to \(\omega^2 \approx \text{trace}(K) / \varepsilon\) — very high frequency, doesn’t pollute the lowest n_req. But if you ask for many modes the phantoms can appear at the high end; check ModalResult.frequency for an unphysical jump.

References#

  • Bench data and the script that produced it: tests/perf/bench_modal_threads.py (PR #798).

  • PBS rotor model: femorph/pbs-demo.

  • PRIMME wheels: femorph/pprime (in-cluster devpi or GitHub Releases).