Thread control ============== femorph-solver integrates with :mod:`threadpoolctl` to cap BLAS / OpenMP thread pools inside every solver. Without a cap, NumPy / SciPy inherit the BLAS backend's default — often the host physical core count — which oversubscribes the CPU on shared runners and makes dense LAPACK paths 5-10x slower from scheduler contention. Three levers, in resolution order: 1. **Per-call kwarg.** Every solver accepts ``thread_limit: int | None = None``:: solve_modal(K, M, prescribed={}, n_modes=10, thread_limit=4) An explicit integer caps threads for that call only. 2. **Process-wide default.** Set once and every subsequent solve picks it up:: import femorph_solver femorph_solver.set_thread_limit(4) 3. **Environment variable.** ``FEMORPH_SOLVER_NUM_THREADS=`` seeds the process-wide default at import time. Convenient for CI / container entrypoints. Passing ``thread_limit=None`` (the default) defers to the process-wide default; if that's also unset, no cap is applied and the backend decides. See :func:`femorph_solver.set_thread_limit` and :func:`femorph_solver.get_thread_limit` in the API reference. What the thread cap actually covers ----------------------------------- The cap flows to three independent thread pools that all otherwise default to the host core count: - **BLAS / LAPACK** (OpenBLAS, MKL) — via ``threadpoolctl``. Covers every ``numpy`` / ``scipy.linalg`` routine including dense eigenpairs, factor solves, and the ``cholmod``/``pardiso`` internal multithreading. - **Pardiso MKL** — inherits the BLAS cap above plus ``iparm[2]`` / ``iparm[24]`` / ``iparm[25]`` controls (parallel factor, forward, back-sub). The wrapper sets ``iparm[25]=1`` by default for parallel triangular solves; the thread cap scales that pool. - **OpenMP inside ``_core``** — the C++ extension's row-parallel assembly (``build_csr_pattern``, ``scatter_into_csr``, ``csr_submatrix_free``, ``csr_matvec``, ``csr_sym_matvec_triu``) reads the same ``OMP_NUM_THREADS`` env variable that ``threadpoolctl`` honours. Picking a good cap ------------------ - **CI runners** — the default GitHub-hosted runner advertises 4 CPUs regardless of actual host topology; cap at 4 explicitly. - **Workstation, dense LAPACK-bound** — cap at physical cores (not including SMT/hyperthreads). ``lscpu | awk '/^Core\\(s\\)/'`` is a good diagnostic. Over-subscribing SMT pairs in dense BLAS typically slows things down 10-20 % on modern x86. - **Workstation, sparse-factor-bound** — same rule. Pardiso on Intel hardware stops scaling past ~8 threads on most FE stiffness patterns; past that you get NUMA-cross-socket traffic that backfires. - **Small problems** (``n_free < 50 k``) — the thread spawn / barrier overhead on ``_core``'s OpenMP pragmas starts to dominate. ``_PARALLEL_MATVEC_NNZ_THRESHOLD`` (default 1 M nnz) in ``solvers/eigen/_arpack.py`` falls back to scipy's single-threaded matvec below that point; the assembly path has its own implicit break-even around 2000 cells. Nightly benchmark runs cap at 4 threads by default to match the ``FEMORPH_SOLVER_NUM_THREADS=4`` environment in ``T5-P17`` perf-trend CI; treat 4 as the conservative default unless you're on a 16+ core workstation.