Thread control#

femorph-solver integrates with threadpoolctl to cap BLAS / OpenMP thread pools inside every solver. Without a cap, NumPy / SciPy inherit the BLAS backend’s default — often the host physical core count — which oversubscribes the CPU on shared runners and makes dense LAPACK paths 5-10x slower from scheduler contention.

Three levers, in resolution order:

  1. Per-call kwarg. Every solver accepts thread_limit: int | None = None:

    solve_modal(K, M, prescribed={}, n_modes=10, thread_limit=4)
    

    An explicit integer caps threads for that call only.

  2. Process-wide default. Set once and every subsequent solve picks it up:

    import femorph_solver
    femorph_solver.set_thread_limit(4)
    
  3. Environment variable. FEMORPH_SOLVER_NUM_THREADS=<int> seeds the process-wide default at import time. Convenient for CI / container entrypoints.

Passing thread_limit=None (the default) defers to the process-wide default; if that’s also unset, no cap is applied and the backend decides.

See femorph_solver.set_thread_limit() and femorph_solver.get_thread_limit() in the API reference.

What the thread cap actually covers#

The cap flows to three independent thread pools that all otherwise default to the host core count:

  • BLAS / LAPACK (OpenBLAS, MKL) — via threadpoolctl. Covers every numpy / scipy.linalg routine including dense eigenpairs, factor solves, and the cholmod/pardiso internal multithreading.

  • Pardiso MKL — inherits the BLAS cap above plus iparm[2] / iparm[24] / iparm[25] controls (parallel factor, forward, back-sub). The wrapper sets iparm[25]=1 by default for parallel triangular solves; the thread cap scales that pool.

  • OpenMP inside ``_core`` — the C++ extension’s row-parallel assembly (build_csr_pattern, scatter_into_csr, csr_submatrix_free, csr_matvec, csr_sym_matvec_triu) reads the same OMP_NUM_THREADS env variable that threadpoolctl honours.

Picking a good cap#

  • CI runners — the default GitHub-hosted runner advertises 4 CPUs regardless of actual host topology; cap at 4 explicitly.

  • Workstation, dense LAPACK-bound — cap at physical cores (not including SMT/hyperthreads). lscpu | awk '/^Core\\(s\\)/' is a good diagnostic. Over-subscribing SMT pairs in dense BLAS typically slows things down 10-20 % on modern x86.

  • Workstation, sparse-factor-bound — same rule. Pardiso on Intel hardware stops scaling past ~8 threads on most FE stiffness patterns; past that you get NUMA-cross-socket traffic that backfires.

  • Small problems (n_free < 50 k) — the thread spawn / barrier overhead on _core’s OpenMP pragmas starts to dominate. _PARALLEL_MATVEC_NNZ_THRESHOLD (default 1 M nnz) in solvers/eigen/_arpack.py falls back to scipy’s single-threaded matvec below that point; the assembly path has its own implicit break-even around 2000 cells.

Nightly benchmark runs cap at 4 threads by default to match the FEMORPH_SOLVER_NUM_THREADS=4 environment in T5-P17 perf-trend CI; treat 4 as the conservative default unless you’re on a 16+ core workstation.