Solver defaults and threading#
This page documents why femorph-solver picks the defaults it does
and how to change them. Every default here was set after a
benchmark sweep on a real model — citations point at
tests/perf/ artifacts and the bench script that produced the
numbers.
Smart thread default#
At import time, femorph_solver sets the BLAS / OpenMP thread
cap to min(8, host_cores). The cap is applied two ways:
OMP_NUM_THREADS/MKL_NUM_THREADS/OPENBLAS_NUM_THREADSare seeded so BLAS pools come up at the right size from the start.femorph_solver._threads.get_thread_limit()returns the same number, used by thethread_limit=Nonedefault in every solver entry point.
User-set env vars are never overwritten — if you’ve exported
OMP_NUM_THREADS=2, that wins.
CPU-affinity pin (opt-in)#
Per-pool caps don’t actually limit total threads — MKL Pardiso, OpenBLAS, libmumps’ internal pthread pool, and the python OpenMP runtime each spawn their own pool, so total live threads on the PBS-rotor cyclic bench reach 23+ even with every cap env var set to 8. The kernel happily schedules all 23 across every core on the host: the cap is “honored” per-pool but the process still uses the whole machine.
Set FEMORPH_SOLVER_AFFINITY to opt INTO a strict CPU-affinity
pin (the same mechanism taskset -c 0-N uses, applied in
process):
|
Result |
|---|---|
unset / empty / |
no pin (default) |
|
pin to first 8 cores |
|
pin to those CPU IDs |
|
pin to cores 8–15 (range) |
|
mix of list + range |
Why opt-in: setting FEMORPH_SOLVER_NUM_THREADS=4 is a per-pool
hint, not a “pin to 4 cores” demand. CI workflows and batch
scripts often set the cap as a budget signal without expecting
pinning — pinning a worker process to 4 cores while xdist legitimately
needs access to more cores would deadlock the test pool. Decoupling
the cap from the pin lets users set both intentionally.
You can also pin at runtime via femorph_solver.pin_cpu_affinity()
(pin_cpu_affinity(8) ties the current process to cores 0-7).
Linux-only; macOS / Windows / restricted-syscall containers fall
through as a soft no-op.
Why 8?#
PBS rotor (pbs-sector-hd.cdb, 494,766 DOFs, MUMPS factor +
ARPACK σ=0 shift-invert, n_modes=4) on a 32-core host:
Threads (kwarg) |
smart default
( |
kwarg-only (no env vars) |
env-only
( |
|---|---|---|---|
1 |
35.14 s |
54.59 s |
50.89 s |
2 |
37.23 s |
47.77 s |
40.79 s |
4 |
34.38 s |
42.14 s |
35.58 s |
8 |
35.06 s |
41.38 s |
34.03 s |
16 |
34.33 s |
39.81 s |
34.61 s |
32 |
36.42 s |
43.64 s |
36.26 s |
Three observations:
8 threads is the sweet spot. 1 → 8 buys 1.5×; 8 → 16 buys nothing; 32 over-subscribes the 32 cores and slows down. The eigensolve loop serialises through ARPACK iterations + scipy
splu(cyclic-augment helper); only the MUMPS factor parallelises freely.kwarg-only without env vars is 5–7 s slower than the env-var path because BLAS / MUMPS pools are sized at pool-init time;
threadpoolctlcan shrink an existing pool but can’t grow it past its init-time max. The smart default makes this difference disappear by setting the env vars before any BLAS-loaded library imports.Smart default is ``min(8, ncores)`` so a 4-core laptop gets 4, a 32-core workstation gets 8, a 96-core EPYC gets 8. Above 8 the structural-K factor stops scaling and the OS scheduler dominates.
Opt-out and override#
Set FEMORPH_SOLVER_NUM_THREADS before importing
femorph_solver:
|
Result |
|---|---|
unset (default) |
|
|
disable cap (host cores) |
|
cap at 4 |
You can also call femorph_solver.set_thread_limit() at
runtime, but only the threadpoolctl-managed BLAS pools
will resize — MUMPS / nested-OpenMP teams keep their
init-time pool size. Always prefer the env var if you need a
fresh cap from the start.
Linear-backend defaults#
Model.solve_static(linear_solver="auto") resolves to the
fastest installed direct backend:
Priority |
Backend |
Picked when… |
|---|---|---|
1 |
Pardiso |
|
2 |
CHOLMOD |
|
3 |
MUMPS |
|
4 |
UMFPACK |
|
5 |
SuperLU |
SciPy default; no extra install. |
Why this order, and per-backend gotchas:
Pardiso#
Fastest on Intel-class CPUs, period. Beats CHOLMOD by 20–40 % on dense 3-D solid models because the two-pass factorisation amortises the symbolic step better.
Honours
MKL_NUM_THREADSat init time; threadpoolctl manages it correctly.Ships via
pip install pypardisowhich pulls in MKL — not redistributable in commercial builds. License: MKL EULA (proprietary).Cluster nodes often have MKL globally; check before installing the wheel.
CHOLMOD#
Competitive with Pardiso on sparse-graph models (beam-shell meshes); pays a constant fill-in overhead on dense connectivity.
Symmetric-positive-definite only — auto-chain falls through to MUMPS for indefinite K.
CeCILL / GPL-compatible — redistributable.
Install:
pip install scikit-sparse+ systemlibsuitesparse-dev.
MUMPS#
Best for out-of-core (
Solver(K, ooc=True)) — spills the factor to$MUMPS_TMPDIR, RAM stops being the ceiling. Pardiso’s OOC mode is less battle-tested.On a single shared-memory node MUMPS is 1.5–3× slower than Pardiso at the same DOF count and uses 10–30 % more factor memory. Pick it for OOC, not in-core speed.
Ships its own OpenMP runtime that
threadpoolctlcannot cap. femorph-solver setsICNTL(16)from the currentthread_limitafteranalyze()so the kwarg path works without env vars (PR #798).Multi-solve gotcha: libmumps’s
JOB=-2(terminate) segfaults on large factors withfree(): invalid next size.MUMPSSolver.__del__skips that call and lets the OS reclaim allocations at process exit — multiple solves in one process are now safe but each large factor leaks until the process ends. See PR #798.Install:
pip install python-mumps+ systemlibmumps.
UMFPACK#
Stable, mature, but no parallel factor. Slower than Pardiso / CHOLMOD on multi-core; useful when you want a deterministic single-threaded factor (CI reproducibility testing).
Install
scikit-umfpackis fragile on slim CI images (meson + swig + Python.h chain often breaks). We don’t add it to the aggregate[solvers]extra; install manually if needed.
SuperLU#
Pure-SciPy fallback. No external dep. ~2-3× slower than the optimised backends; on the PBS rotor in our bench it also OOMs for the 12k-DOF cyclic-augmented operator.
Always available — the safety net.
Eigen-backend defaults#
Model.solve_modal(eigen_solver="auto") always resolves to
ARPACK as of 2026-05. PRIMME stays registered (the
cyclic-modal solver uses it through an OPinv-aware code path),
but the simple modal autotune doesn’t pick it. See
PBS rotor — solver head-to-head at 8 threads for the rationale and the bench
numbers that motivated the change.
Quick version: ARPACK + a sparse direct linear backend
(mkl_direct / pardiso / mumps) beats MAPDL Block
Lanczos by 25 % on the 500k-DOF PBS rotor at 8 threads. The
~10–20 % marginal speedup PRIMME might offer on a narrow size
band (10k–100k DOFs, when its no-OPinv Davidson path converges)
isn’t worth exposing the matvec-error footgun PRIMME hits on
larger / ill-conditioned K. Users who want PRIMME explicitly
can still pass eigen_solver="primme".
ARPACK#
SciPy default (
scipy.sparse.linalg.eigsh). Lanczos iteration with implicit restart (Sorensen 1992).Best for shift-invert (
sigma != 0) — PRIMME’s python binding doesn’t expose the sigma-shift mode.Always available; no extra dep.
The autotune’s only choice for
solve_modalpost-2026-05.
PRIMME#
Block-Davidson with GD+k restart (Stathopoulos & McCombs 2010) — beats ARPACK on the small-modal sweet spot (4–10 modes) on real-world structural matrices, when an OPinv is supplied.
Honours
threadpoolctlthread limits — PRIMME calls scipy/numpy BLAS internally and doesn’t ship its own OpenMP runtime.Optional dep: not in core
femorph-solverinstall because the upstream PyPI package ships sdist only — everypip install primmetriggers a 60-90 s source build that needs gcc + gfortran + BLAS / LAPACK headers. On NixOS / minimal Ubuntu images the link step fails withld: cannot find -llapack.
Installing PRIMME via femorph/pprime#
femorph/pprime is the
internal repo that builds proper manylinux_2_28 PRIMME
wheels (cp310–cp314) so every customer host can
pip install without a fortran toolchain.
In-cluster (femorph kubernetes pods):
pip install --index-url \
http://devpi.femorph-devpi.svc.cluster.local/root/femorph/+simple/ \
primme
External hosts (workstations, executor) — install from the GitHub release attached to the latest tag:
gh release download v3.2.3 --repo femorph/pprime \
--pattern "*cp$(python -c 'import sys; print(f\"{sys.version_info.major}{sys.version_info.minor}\")')*" \
-D /tmp/
pip install /tmp/primme-3.2.3-cp*.whl
Dev workstation note: if you’re on executor or a similar
NixOS box, install via the GitHub-release path above — the
manylinux wheel bundles libgfortran / libblas / libgomp via
auditwheel repair so it Just Works without a system
fortran.
The fallback (no wheel for your interpreter) is the upstream PyPI source build:
sudo apt install gfortran libblas-dev liblapack-dev
pip install primme
After install, femorph_solver’s autotune picks PRIMME
automatically when n_modes >= 4 and sigma == 0.
Verifying PRIMME respects thread limits#
This took two tries to nail down, and the answer is more subtle than “the cap works”.
5,000-DOF banded SPD problem, primme.eigsh with
threadpoolctl.limit(N) applied at runtime, no env vars
set at process start:
Threads (cap) |
wall time |
CPU time |
effective threads |
|---|---|---|---|
1 |
0.28 s |
1.10 s |
3.9 (!) |
2 |
0.29 s |
1.23 s |
4.3 |
4 |
0.29 s |
1.38 s |
4.7 |
8 |
0.29 s |
1.18 s |
4.1 |
16 |
0.28 s |
1.17 s |
4.2 |
32 |
0.28 s |
1.35 s |
4.8 |
Threadpoolctl reports the BLAS pool capped at the requested
size (pool=[1, 1] for cap=1) but CPU time is 4-5× wall
time regardless — the runtime cap isn’t actually limiting
parallel work. Same problem with the same threadpoolctl.limit(1)
context, but with OMP_NUM_THREADS=1 set before Python
starts:
Threads (cap) |
wall time |
CPU time |
effective threads |
|---|---|---|---|
|
0.29 s |
0.29 s |
1.0 (clean) |
What’s happening:
The
pprimewheel bundleslibblas-fe34f726.so.3.8.0— reference netlib BLAS, no OpenMP / pthreads symbols (nm -Dconfirms). PRIMME’s dense subspace algebra is single-threaded through this bundled BLAS.The 4-5 effective threads come from numpy + scipy’s
libscipy_openblaspool — 32 worker threads pre-spawned at process start.threadpoolctl.limit(N)adjusts the active worker count per call, but cannot un-spawn the pool’s pre-existing threads. Their idle scheduling and memory accesses still burn CPU even when the active count is supposed to be 1.Setting
OMP_NUM_THREADS=Nbeforeimport numpyinit’s the pool with exactly N threads → no over-spawn, no phantom CPU burn. This is what femorph-solver’s smart default does.
Bottom line: PRIMME does respect threadpoolctl, but the
realised parallel work depends on (1) whether PRIMME’s
inner BLAS calls go through scipy’s OpenBLAS or the bundled
netlib BLAS — pprime’s wheel uses the latter — and (2)
whether OpenBLAS was init’d with the right thread cap. The
smart default + OMP_NUM_THREADS env-var seeding handles
both correctly.
Caveat for the maintainer: if you want PRIMME itself to
parallelise (not just the scipy BLAS calls it makes), the
pprime wheel would need to be rebuilt against
multi-threaded OpenBLAS or MKL. That’s tracked separately;
for the typical small-modal use case (4-10 modes, sparse K)
PRIMME’s serial path is already faster than ARPACK.
Iterative backends#
For models that don’t fit in the direct-factor RAM ceiling (roughly > 1 M DOFs in a 32-GB workstation):
CG + AMG preconditioner#
Model.solve_static(linear_solver="cg+pyamg")— conjugate gradient with classical algebraic multigrid preconditioner viapyamg.SPD only. Memory scales as \(O(N)\) instead of the direct factor’s \(O(N^{1.5})\).
Cost ~3-5× the direct factor’s wall time at the crossover (~1 M DOF), but still solvable when the direct factor would OOM.
Install:
pip install pyamg.
GMRES#
For non-SPD K (large rotation, contact, mixed-formulation shells).
Model.solve_static(linear_solver="gmres").No preconditioner by default — converges slowly on ill-conditioned K. Use sparingly.
Known issues and workarounds#
MUMPS double-free on teardown#
Pre-existing libmumps bug: large factors hit free():
invalid next size / double free or corruption (!prev)
during the C-level cleanup. MUMPSSolver.__del__ skips
the libmumps JOB=-2 call in PR #798 — multiple solves
in one process are now stable, at the cost of per-solve
memory leak (reclaimed at process exit).
If you observe this in a long-running service that never exits, switch to a subprocess pool: each solve runs in a fresh subprocess and the OS reclaims everything when the subprocess ends.
Pardiso threading on AMD#
MKL Pardiso on AMD CPUs can fall back to a serial path
unless MKL_DEBUG_CPU_TYPE=5 is exported (forces
avx2 codepath). Doesn’t affect Intel.
Eigenvalue regularisation (ε I) artefacts#
When M_ff is rank-deficient (deck pins a load DOF, BEAM
with no DENS, …) the dense la.eigh fails and we add
ε = 1e-14 · trace(K) / N to M. Phantom modes go to
\(\omega^2 \approx \text{trace}(K) / \varepsilon\) —
very high frequency, doesn’t pollute the lowest n_req.
But if you ask for many modes the phantoms can appear at
the high end; check ModalResult.frequency for an
unphysical jump.
References#
Bench data and the script that produced it:
tests/perf/bench_modal_threads.py(PR #798).PBS rotor model:
femorph/pbs-demo.PRIMME wheels:
femorph/pprime(in-cluster devpi or GitHub Releases).