Thread control#
femorph-solver integrates with threadpoolctl to cap BLAS / OpenMP
thread pools inside every solver. Without a cap, NumPy / SciPy
inherit the BLAS backend’s default — often the host physical core
count — which oversubscribes the CPU on shared runners and makes
dense LAPACK paths 5-10x slower from scheduler contention.
Three levers, in resolution order:
Per-call kwarg. Every solver accepts
thread_limit: int | None = None:solve_modal(K, M, prescribed={}, n_modes=10, thread_limit=4)
An explicit integer caps threads for that call only.
Process-wide default. Set once and every subsequent solve picks it up:
import femorph_solver femorph_solver.set_thread_limit(4)
Environment variable.
FEMORPH_SOLVER_NUM_THREADS=<int>seeds the process-wide default at import time. Convenient for CI / container entrypoints.
Passing thread_limit=None (the default) defers to the
process-wide default; if that’s also unset, no cap is applied and the
backend decides.
See femorph_solver.set_thread_limit() and
femorph_solver.get_thread_limit() in the API reference.
What the thread cap actually covers#
The cap flows to three independent thread pools that all otherwise default to the host core count:
BLAS / LAPACK (OpenBLAS, MKL) — via
threadpoolctl. Covers everynumpy/scipy.linalgroutine including dense eigenpairs, factor solves, and thecholmod/pardisointernal multithreading.Pardiso MKL — inherits the BLAS cap above plus
iparm[2]/iparm[24]/iparm[25]controls (parallel factor, forward, back-sub). The wrapper setsiparm[25]=1by default for parallel triangular solves; the thread cap scales that pool.OpenMP inside ``_core`` — the C++ extension’s row-parallel assembly (
build_csr_pattern,scatter_into_csr,csr_submatrix_free,csr_matvec,csr_sym_matvec_triu) reads the sameOMP_NUM_THREADSenv variable thatthreadpoolctlhonours.
Picking a good cap#
CI runners — the default GitHub-hosted runner advertises 4 CPUs regardless of actual host topology; cap at 4 explicitly.
Workstation, dense LAPACK-bound — cap at physical cores (not including SMT/hyperthreads).
lscpu | awk '/^Core\\(s\\)/'is a good diagnostic. Over-subscribing SMT pairs in dense BLAS typically slows things down 10-20 % on modern x86.Workstation, sparse-factor-bound — same rule. Pardiso on Intel hardware stops scaling past ~8 threads on most FE stiffness patterns; past that you get NUMA-cross-socket traffic that backfires.
Small problems (
n_free < 50 k) — the thread spawn / barrier overhead on_core’s OpenMP pragmas starts to dominate._PARALLEL_MATVEC_NNZ_THRESHOLD(default 1 M nnz) insolvers/eigen/_arpack.pyfalls back to scipy’s single-threaded matvec below that point; the assembly path has its own implicit break-even around 2000 cells.
Nightly benchmark runs cap at 4 threads by default to match the
FEMORPH_SOLVER_NUM_THREADS=4 environment in T5-P17 perf-trend
CI; treat 4 as the conservative default unless you’re on a 16+ core
workstation.