Solver exploration (extras) =========================== This page documents an in-depth investigation into sparse SPD direct solvers as alternatives to MKL Pardiso for femorph-solver's static and modal pipelines. It's an "extras" page in the sense that none of it changes the recommended default — that remains Pardiso — but it closes a pile of real questions about why, and points at what to try if your constraints differ from the common case. .. contents:: :local: :depth: 2 Why we looked ------------- After landing the low-hanging memory-reduction PRs (grid-as-sole-truth, triu K/M assembly, shared sparsity pattern, pypardiso's ``A.copy()`` elision, factor-release on GC), peak RSS on a ``200×200×2`` SOLID185 modal solve sat at ~2130 MB — about 2× MAPDL's in-core. The remaining gap is the Pardiso Cholesky factor itself (iparm(17) = 1.14 GB peak during factorisation, 0.39 GB permanent). That's a hard floor for a direct solver on this mesh topology. Two plausible ways to move past that floor: 1. A solver that produces a smaller factor (MUMPS, CHOLMOD, STRUMPACK with HSS compression, SPRAL). 2. A solver with working out-of-core (OOC) that caps peak RSS regardless of problem size. MKL Pardiso appeared to ignore ``iparm(60)`` OOC on pypardiso's bundled MKL build — ``set_iparm(60, 2)`` warns "no input iparm" and the readback is 0. The investigation below started as a systematic sweep of #1 on that assumption. **Plot twist**: it turned out pypardiso's Python layer was filtering the iparm dict — not MKL itself. A direct ctypes wrapper (agent-3's :class:`DirectMklPardisoSolver`, PR #126) honours ``iparm(60)=2``, and the pip-wheel MKL's OOC works. With a small fix to disable incompatible fast-parallel flags during OOC, **we land −767 MB peak (−32%) at 200×200×2** — the biggest memory win of the investigation. See the "Direct MKL Pardiso with working OOC" section below for the details. The #1 sweep (candidate direct solvers) still produced useful data — it's how we ruled in MUMPS as a speed alternative and ruled out STRUMPACK HSS / scalar simplicial Cholesky. Candidate solvers considered ---------------------------- All are sparse direct SPD Cholesky-class solvers that could plausibly match or beat MKL Pardiso: .. list-table:: :widths: 18 12 50 20 :header-rows: 1 * - Solver - Origin - Selling point - Final verdict * - **MKL Pardiso** - Intel - Supernodal, multithreaded. Production default. - Memory leader on realistic pipelines * - **MUMPS** - CERFACS / IRIT - Multifrontal, real OOC (if compiled in) - Speed alternative for modal; +313 MB memory at 200×200×2 * - **CHOLMOD supernodal** - SuiteSparse - Supernodal, uses BLAS-3 heavily - −33 MB at K_ff stage, 3× slower; +8 MB on full pipeline * - **CHOLMOD simplicial** - SuiteSparse - Scalar Cholesky; very low memory per the docs - Pathologically slow at scale (19× factor) and denser factor * - **STRUMPACK exact** - LBNL - Multifrontal with optional HSS compression - +1 GB vs Pardiso; defeated before HSS even kicks in * - **STRUMPACK HSS** - LBNL - 2–4× factor compression on elliptic problems (theoretical) - HSS ratio on 3D hex elasticity is negative — more memory AND lower accuracy * - **Eigen SimplicialLDLT** - Guennebaud et al. - Header-only, no external deps - Slowest; disqualified on both axes * - **SPRAL SSIDS** - STFC UK - Modern multithreaded, low-memory footprint per the literature - Scaffold built; runtime ``ssids_factor`` returns ``flag = -53`` regardless of options. Needs deeper debugging. * - **PaStiX** - INRIA - Static scheduling, Scotch ordering - Not on NixOS; not attempted. * - **HSL MA87** - Harwell / STFC - Designed for minimum memory on SPD Cholesky - Academic license; not attempted. The NixOS OpenBLAS 0.3.30 bug ----------------------------- Any non-Pardiso solver that performs supernodal-style dense LAPACK kernels from inside its own elimination tree immediately breaks when linked against NixOS's ``openblas-0.3.30``. The symptom is a cascading series of ``xerbla``-printed errors followed by either a stack-smash detection or a segfault: .. code-block:: text ** On entry to DGETRF parameter number 4 had an illegal value ** On entry to DPOTRF parameter number 4 had an illegal value ** On entry to DTRSM parameter number 11 had an illegal value ** On entry to DGEMM parameter number 13 had an illegal value *** stack smashing detected ***: terminated This was reproduced on **three separate solvers** (STRUMPACK 7.2 / STRUMPACK 8.0 / CHOLMOD supernodal / MUMPS) — all failing with the same pattern. ``ulimit -s unlimited``, ``OMP_NUM_THREADS=1``, and GCC-only (no Clang/libomp mixing) did not help. **Workaround**: ``LD_PRELOAD=libmkl_rt.so.2`` before importing the affected solver. MKL's LAPACK handles the call patterns that NixOS's OpenBLAS 0.3.30 chokes on. Once preloaded, every tested solver runs cleanly with residuals in the 1e-10 to 1e-15 band. Root-cause-chasing with AddressSanitizer localised the actual read fault in STRUMPACK to :func:`dlaswp_plus` inside OpenBLAS — row-swap during the solve phase dereferences a pivot-index array that :func:`dgetrf` returned with undefined contents on degenerate fronts. Patches to STRUMPACK's ``BLASLAPACKWrapper.cpp::getrf`` and ``DenseMatrix::laswp`` guard that specific corruption, but the remaining illegal-argument messages from ``dgemm``, ``dpotrf``, and ``dtrsm`` indicate the bug lives at several OpenBLAS call sites, not one. Fixing this for the whole ecosystem means either a new OpenBLAS release or a NixOS repackage — filing upstream is the right move. Benchmark methodology --------------------- Each solver was benchmarked in three regimes on the SOLID185 flat plate that :mod:`femorph_solver.perf.bench_memory` uses (the "perf" bench fixture): 1. **K_ff-only microbench** — assemble ``K_ff`` and ``M_ff`` via :meth:`Model._bc_reduce_and_release`, factor ``K_ff`` once, solve once, capture peak RSS. Measures raw solver overhead without pipeline noise. 2. **Static pipeline** — full ``model.solve(linear_solver=...)`` with a surface load, from fresh ``Model.from_grid`` through displacement return. Measures the solver's integration with assembly + BC reduce + solve. 3. **Modal pipeline** — full ``model.modal_solve(n_modes=10, eigen_solver='arpack', linear_solver=...)`` at σ=0 shift-invert. ARPACK drives ~30–50 inner solves per outer iteration, so this is where the solver's solve-phase overhead accumulates. Each row runs in a **fresh subprocess** so ``ru_maxrss`` is clean. Thread cap: 4 (MAPDL parity). Environment preloads MKL: .. code-block:: bash OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 \ OMP_WAIT_POLICY=PASSIVE KMP_BLOCKTIME=0 \ LD_PRELOAD=/libmkl_rt.so.2 \ python perf/bench_solvers.py … Sizes: ``40×40×2``, ``80×80×2``, ``120×120×2``, ``160×160×2``, ``200×200×2``. Results — K_ff-only microbench ------------------------------ Factor wall time (seconds): .. list-table:: :widths: 15 12 12 12 14 14 14 :header-rows: 1 * - size - Pardiso - MUMPS - CHOLMOD-sup - STRUMPACK - CHOLMOD-simp - Eigen LDLT * - 40×40×2 - 0.32 - **0.17** - 0.09 - 0.38 - 0.69 - 0.31 * - 80×80×2 - **0.63** - 0.82 - 0.59 - 1.09 - 3.12 - 5.01 * - 120×120×2 - **1.31** - 2.69 - 1.73 - 4.43 - 8.75 - 32.6 * - 160×160×2 - **2.23** - 3.51 - 7.22 - 8.04 - 36.6 - 49.3 * - 200×200×2 - **3.48** - 4.38 - 10.7 - 16.8 - 48.4 - 127.6 Peak RSS (MB) and Δ vs Pardiso: .. list-table:: :widths: 15 10 12 14 14 14 12 :header-rows: 1 * - size - Pardiso - MUMPS Δ - CHOLMOD-sup Δ - STRUMPACK Δ - CHOLMOD-simp Δ - Eigen Δ * - 40×40×2 - 261 - −8 - −9 - +18 - −10 - −18 * - 80×80×2 - 504 - −24 - −14 - +117 - +14 - +58 * - 120×120×2 - 930 - −31 - −27 - +291 - +49 - +211 * - 160×160×2 - 1538 - −46 - −30 - +585 - +98 - +531 * - 200×200×2 - 2379 - **−75** - −33 - +1021 - +223 - +937 At this measurement scale MUMPS saves 75 MB over Pardiso at 200×200×2. **This turned out to be misleading** — see the next section. Results — full static pipeline ------------------------------ .. list-table:: :widths: 12 12 12 12 12 12 12 :header-rows: 1 * - size - Pardiso wall - Pardiso peak - MUMPS wall - MUMPS peak - Δ wall - Δ peak * - 80×80×2 - 0.69 s - 580 MB - **0.62 s** - 594 MB - −10% - +14 MB * - 120×120×2 - 1.47 s - 1091 MB - 1.71 s - 1121 MB - +16% - +30 MB * - 160×160×2 - 2.66 s - 1816 MB - 3.19 s - 1852 MB - +20% - +36 MB * - 200×200×2 - **4.28 s** - **2772 MB** - 5.46 s - 2853 MB - +28% - +81 MB Displacement fields match bit-for-bit. Pardiso wins both axes on the full static pipeline. Results — full modal pipeline ----------------------------- 10 modes via ARPACK shift-invert at σ=0: .. list-table:: :widths: 12 12 12 12 12 12 12 :header-rows: 1 * - size - Pardiso wall - Pardiso peak - MUMPS wall - MUMPS peak - Δ wall - Δ peak * - 80×80×2 - 1.98 s - 476 MB - **1.54 s** - 516 MB - **−22%** - +40 MB * - 120×120×2 - 5.25 s - 849 MB - **5.03 s** - 959 MB - −4% - +110 MB * - 160×160×2 - 9.37 s - 1402 MB - **7.91 s** - 1592 MB - **−16%** - +190 MB * - 200×200×2 - 15.19 s - **2138 MB** - **12.46 s** - 2451 MB - **−18%** - **+313 MB** MUMPS is consistently 15–20 % faster on modal at the cost of +200–300 MB peak RSS. The K_ff-only microbench's ``-75 MB`` reading does not generalise — MUMPS's solve-phase workspace accumulates across ARPACK's ~30 inner solves, more than erasing the single-solve win. Drop-in parity: modal frequencies from ``linear_solver="mumps"`` and ``linear_solver="pardiso"`` agree to **max relative drift 2.5 × 10⁻¹⁰** across all 10 modes on a 40×40×2 SOLID185 cantilever. Well below the 1e-7 T1-P10 parity target. STRUMPACK HSS — why the headline feature didn't land ---------------------------------------------------- STRUMPACK's HSS (Hierarchically Semi-Separable) compression is why we spent the most effort there. The literature claims 2–4× factor compression for 3D elliptic problems — if that held for SOLID185, STRUMPACK would undercut Pardiso by several hundred MB. Measured on the same mesh class, ``compression_tol = 1e-6``: .. list-table:: :widths: 15 15 15 15 15 15 :header-rows: 1 * - size - HSS 1e-6 peak (MB) - HSS 1e-6 residual - HSS 1e-8 peak (MB) - HSS 1e-8 residual - Pardiso peak (MB) * - 40×40×2 - 279 - 3.9e-10 - 278 - 3.9e-10 - 261 * - 80×80×2 - 621 - 3.8e-10 - 620 - 3.8e-10 - 504 * - 120×120×2 - 1182 - **4.0e-05** - 1185 - 3.0e-09 - 930 * - 160×160×2 - 2054 - **6.3e-08** - 2056 - **1.4e-06** - 1538 * - 200×200×2 - 3244 - **5.1e-06** - 3251 - **2.5e-06** - 2380 Two problems stack: 1. **Peak RSS exceeds Pardiso at every size.** STRUMPACK's own working set + the compressed factor stacks larger than Pardiso's supernodal factor. 2. **Accuracy degrades to 10⁻⁶** at large sizes regardless of the ``compression_tol`` setting. For modal ARPACK shift-invert (which needs ~1e-10 convergence at ``tol=1e-12``) this would bleed into the outer eigenvalue accuracy. SOLID185 is a first-order hex element. Its factor fronts have high off-diagonal rank — the elasticity Green's function doesn't produce the low-rank block structure HSS compression exploits. HSS's sweet spot is elliptic integral operators and smooth-kernel discretisations, not structural FEM on uniform hex meshes. This isn't a build or configuration issue. It's STRUMPACK telling us honestly that HSS has nothing to compress on our mesh class. CHOLMOD supernodal — memory-competitive, always slower ------------------------------------------------------ CHOLMOD supernodal (via ``scikit-sparse`` or a direct nanobind binding) runs once the NixOS OpenBLAS bug is sidestepped with MKL preload. On K_ff-only it saves ~30 MB vs Pardiso at 200×200×2 but takes 3× longer to factor. On the full modal pipeline the advantage disappears entirely. Useful as a fallback for build environments where MKL isn't linkable but a working OpenBLAS is — CHOLMOD is less picky about its BLAS than the supernodal-in-Fortran solvers (STRUMPACK, MUMPS). Eigen SimplicialLDLT — the "no external deps" option ---------------------------------------------------- Eigen's ``SimplicialLDLT, Lower>`` is header-only and runs cleanly on any BLAS. At 40×40×2 it's actually the smallest peak RSS of everything tested (230 MB vs Pardiso's 261). Above that it loses on both axes — scalar simplicial Cholesky with AMD ordering produces a denser factor than supernodal with METIS, and runs single-threaded. At 200×200×2 it's 176 seconds and 3316 MB — 50× slower than Pardiso and using 937 MB more memory. Useful only as a last-resort fallback for environments where nothing else links. Good to know it's correct; not something to use in anger. Honest recommendation --------------------- **Stay on Pardiso.** The night's investigation didn't find a solver that's strictly better than Pardiso for femorph-solver's workload on this class of mesh. The full pipeline numbers are the ones that matter — not K_ff-only microbenches — and on those Pardiso is the memory leader by 80–300 MB at 200×200×2. **Use MUMPS if you need 15–20 % faster modal and have RAM to spare.** It's a drop-in via a 60-line ``MumpsSolver`` adapter and produces the same frequencies Pardiso does to 2.5 × 10⁻¹⁰. **Don't bother with STRUMPACK HSS** on hex elasticity meshes — the compression ratio is negative and the error is high. **Skip CHOLMOD simplicial and Eigen LDLT** for any mesh beyond "trivial" — they're either slow, memory-hungry, or both. Build environment notes (for NixOS / nix-like systems) ------------------------------------------------------ * ``gfortran`` is often not on PATH when GCC is — export ``/nix/store/.../gfortran-wrapper-*/bin`` before any build that needs Fortran. * NixOS GCC's ``-fstack-protector-strong`` + ``-D_FORTIFY_SOURCE=2`` default mask VLA bugs in STRUMPACK's multifrontal path. Strip with ``-U_FORTIFY_SOURCE -fno-stack-protector`` to surface the real failure at its source (not the canary-check). * Mixing Clang ``libomp`` and GCC ``libgomp`` races over thread state. Pick one runtime end-to-end; on NixOS that means building C++ with ``CC=gcc CXX=g++`` since gfortran's ``libgomp`` is the default and OpenBLAS links against it. * NixOS OpenBLAS 0.3.30 has a real bug affecting sparse-solver supernodal kernels. Workaround: ``LD_PRELOAD=libmkl_rt.so.2``. SPRAL SSIDS — fast but memory-heavy on hex elasticity ----------------------------------------------------- SPRAL was the single remaining candidate that might beat Pardiso on memory. Its published benchmarks show it matching HSL MA87 and beating MUMPS on several metrics while maintaining lower peak RSS than CHOLMOD on sparse SPD problems. **Getting SPRAL to run**: the SSIDS CPU factor requires the OpenMP cancellation feature, which GCC's libgomp keeps off by default. The solver returns ``inform%flag = -53`` — undocumented numerically but prints ``SSIDS CPU code requires OMP cancellation to be enabled`` at ``print_level=10``. Fix: export ``OMP_CANCELLATION=TRUE`` **before libgomp initialises** (i.e. before numpy / scipy import). Setting it from Python via ``os.environ`` post-import is a no-op because libgomp has already cached the runtime. Subprocess launchers need it in the env dict passed to ``Popen``. **Measured results on the SOLID185 modal pipeline** (4 threads, MKL preload, posdef=true): .. list-table:: :widths: 15 15 15 15 15 15 :header-rows: 1 * - size - Pardiso wall - Pardiso peak - SPRAL wall - SPRAL peak - Δ peak vs Pardiso * - 80×80×2 - 1.98 s - 476 MB - **1.88 s** - 752 MB - +276 MB * - 120×120×2 - 5.25 s - 849 MB - **4.64 s** - 1450 MB - +601 MB * - 160×160×2 - 9.37 s - 1402 MB - **8.46 s** - 2475 MB - +1073 MB * - 200×200×2 - 15.19 s - **2138 MB** - **13.66 s** - 3694 MB - **+1556 MB** **SPRAL is the fastest solver at every mesh size** — beats both Pardiso and MUMPS on wall time. But the peak-RSS cost is catastrophic: +276 MB at 80×80×2 widens to +1556 MB (+73%) at 200×200×2. SPRAL's multifrontal factor allocates a much larger working set than Pardiso's supernodal packing; the published memory advantages don't hold on hex elasticity stiffness matrices. So despite SPRAL being the last plausible memory candidate, it doesn't deliver memory savings here either. It joins MUMPS as a **speed alternative** — if the user values modal wall time and has RAM headroom, SPRAL is 8–10% faster than Pardiso across sizes. If memory is binding, stay on Pardiso. Three-way speed comparison at 200×200×2: .. list-table:: :widths: 20 20 20 40 :header-rows: 1 * - solver - wall time - peak RSS - summary * - Pardiso - 15.19 s - 2138 MB - baseline, memory leader * - MUMPS - 12.46 s (-18%) - 2451 MB (+313 MB) - balanced speed / memory tradeoff * - SPRAL - 13.66 s (-10%) - 3694 MB (+1556 MB) - fastest small-to-medium sizes, worst memory Final recommendation unchanged: **Pardiso** stays the default. The two viable alternatives are both speed wins at memory cost: - **MUMPS** for a moderate speed boost (−15–20%) at moderate memory cost (+200–300 MB) — best general-purpose alternative. - **SPRAL** for the maximum speed at any size (−8–20%) at much higher memory cost (+300–1500 MB) — only for memory-abundant setups. SPRAL integration adapter lives at ``mapdl-re/pyspral/src/pyspral/femorph_integration.py``. Drop-in via:: _REGISTRY["spral"] = SpralSolver model.modal_solve(linear_solver="spral") Requires ``OMP_CANCELLATION=TRUE`` in the environment and ``LD_PRELOAD=libmkl_rt.so.2`` to sidestep the NixOS OpenBLAS 0.3.30 cascade. Direct MKL Pardiso with working OOC — the biggest memory win ------------------------------------------------------------ pypardiso silently swallows ``iparm(60)`` (OOC mode selector). Call ``set_iparm(60, 2)`` through pypardiso and it warns ``2 is no input iparm``; the readback after factor shows 0. The assumption that this was a MKL limitation on the pip-wheel build turned out to be wrong — it's pypardiso's Python layer filtering the iparm dict. Agent-3's ``DirectMklPardisoSolver`` (in-tree at :mod:`femorph_solver.solvers.linear._mkl_pardiso`, PR #126) calls ``pardiso_()`` via ctypes directly. Bypassing pypardiso unlocks the full iparm array — including ``iparm(60)=2`` for out-of-core factor. The pip-wheel MKL's OOC implementation works as long as ``MKL_PARDISO_OOC_PATH`` points at a writable directory. Design highlights of the in-tree solver: * Reuses pypardiso's already-loaded MKL handle via ``pypardiso.PyPardisoSolver(mtype=11).libmkl`` — no duplicate ``dlopen``, same resolution heuristic. * ``iparm(34)=1`` — 0-based CSR indices eliminate the ``+1`` rewrite copies pypardiso does on every solve (~65 MB saved per solve at 150×150×2). * ``iparm(23)=1 + iparm(24)=2`` — "improved 2-level parallel" factorisation + 2-level solve, roughly halves solve wall time vs the classical path. * Phase 12 in one call (analyse + factorise). * Phase -1 on ``__del__`` — same MKL factor-leak fix as PR #107. **OOC bug uncovered during this investigation**: on meshes at or above 160×160×2 (n_dof ≥ 231 840), ``ooc=True`` returned MKL error ``-2`` (insufficient memory) at phase=12. Root cause: the ``iparm(23)=1 + iparm(24)=2`` fast-parallel allocations are incompatible with ``iparm(60)=2`` OOC on this MKL build — the per-thread scratch sized for the whole frontal matrix is what OOC refuses to spill. **Fix** (local branch ``fix/mkl-pardiso-ooc-disable-fast-parallel``): drop iparm(23)/(24) to the classical parallel path when OOC is requested. Costs ~2× solve wall time in exchange for OOC actually working at scale. With the patch applied: .. list-table:: Agent-3 ``DirectMklPardisoSolver`` — in-core vs OOC :widths: 15 18 18 20 12 15 :header-rows: 1 * - size - in-core peak - OOC peak - Δ peak - in-core factor - OOC factor * - 80×80×2 - 505 MB - 511 MB - +6 MB - 0.38 s - 0.67 s * - 120×120×2 - 920 MB - 935 MB - +15 MB - 0.82 s - 1.26 s * - 160×160×2 - 1526 MB - **1265 MB** - **−261 MB (−17%)** - 1.73 s - 2.49 s * - 200×200×2 - **2371 MB** - **1604 MB** - **−767 MB (−32%)** - 2.50 s - 4.12 s **This is the single biggest SPD-solver memory reduction measured across the entire investigation.** The gap widens with problem size — at small meshes OOC equals in-core because the factor fits in ``MAX_CORE_SIZE``; at large meshes OOC streams the overflow to disk, capping peak RSS near ``MAX_CORE_SIZE`` + K / M / eigensolve workspace. Identical residual (4.3e-11 on both at 200×200×2), 65% longer factor, solve time 2-3× slower because disk IO is in the critical path. **Enablement**:: # Required env vars — must be set before first MKL call. export MKL_PARDISO_OOC_PATH=/tmp/mkl_pardiso_ooc export MKL_PARDISO_OOC_MAX_CORE_SIZE=500 # In Python from femorph_solver.solvers.linear._mkl_pardiso import DirectMklPardisoSolver lu = DirectMklPardisoSolver(K_ff, assume_spd=True, ooc=True) Or via the registry when the fix lands:: model.modal_solve(linear_solver="mkl_direct_ooc") # future registration **Tradeoff context at 200×200×2**: - Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor - Pardiso (DirectMklPardiso, in-core): 2371 MB peak, 2.50 s factor - **Pardiso (DirectMklPardiso, OOC)**: **1604 MB peak**, 4.12 s factor - MUMPS: 2304 MB peak, 4.38 s factor - SPRAL: 3694 MB peak, 13.66 s factor - Eigen CG+IC0: 1091 MB peak at 160×160×2, 44× slower Direct-MKL OOC delivers the largest peak-RSS reduction of any direct solver tested while remaining competitive on wall time (~2× slower than in-core Pardiso, still faster than MUMPS). This is the recommended memory-constrained path for femorph-solver. Iterative fallback — Eigen CG with IncompleteCholesky ------------------------------------------------------ After exhausting direct solvers, the final angle was **preconditioned conjugate gradient with incomplete Cholesky** (IC(0)) — pure iterative, no factor to store. Memory is just the matrix, the IC preconditioner (roughly the size of triu K), and a few CG basis vectors. If it converges in reasonable time, this is the single lowest-memory option for SPD. Measured on the K_ff-only pipeline (tol = 1e-6, max 100 000 iterations): .. list-table:: :widths: 15 12 14 14 15 14 :header-rows: 1 * - size - n_dof - CG wall - CG iters - CG peak - Pardiso peak * - 40×40×2 - 14 760 - 1.35 s - 1 468 - **211 MB** - 261 MB (-50) * - 80×80×2 - 58 320 - 17.8 s - 3 368 - **387 MB** - 504 MB (-117) * - 120×120×2 - 130 680 - 43.6 s - 3 325 - **683 MB** - 930 MB (-247) * - 160×160×2 - 231 840 - 97.9 s - 4 242 - **1091 MB** - 1538 MB (**-447**) At 160×160×2 CG+IC(0) uses **29% less memory** than Pardiso. The gap widens with problem size — this is the only candidate that actually shrinks faster than Pardiso in absolute terms as the mesh grows. **The fatal catches** though: 1. **44× slower** than Pardiso at 160×160×2 (97.9 s vs 2.23 s). The iteration count scales with :math:`\sqrt{\mathrm{cond}(K)}`, which for structured hex elasticity grows like :math:`h^{-1}`. Not scalable. 2. **Residual floor at 1e-6** — the IC(0) preconditioner isn't strong enough to drive CG below 1e-6 in any reasonable iteration budget on elasticity stiffness. ARPACK shift-invert at tolerance 1e-12 needs the inner solver at ~1e-10 so modal is out of reach; a 1e-6 residual is fine for static solves where the user accepts single-precision-ish accuracy. **Use cases**: - **Static solve on a severely memory-constrained machine** where shaving 29% off peak RSS matters more than a 40× speed hit and the user accepts 1e-6 accuracy. Not the default — a niche override. - **Not viable for modal** — the residual ceiling would bleed into ARPACK's eigenvalue accuracy. This is genuinely the only memory-saving option we found, and the tradeoff is steep. Exposed via ``pyeigensparse.CGICSolver``; bindings live at ``mapdl-re/pyeigensparse/``. Direct MKL Pardiso wrapper — unlocking OOC ------------------------------------------ pypardiso silently swallows ``iparm(60)`` (OOC mode selector) — when you call ``set_iparm(60, 2)`` it warns ``2 is no input iparm`` and the readback after factor shows 0. We suspected this was a MKL limitation in the pip-wheel build. It wasn't. It was pypardiso's Python layer. A 300-line ctypes wrapper calling ``pardiso_()`` directly (no pypardiso) honours every iparm we set — including ``iparm(60)=2`` (full out-of-core). The pip-wheel MKL's OOC works as long as ``MKL_PARDISO_OOC_PATH`` points at a writable directory. Package: ``mapdl-re/pymklpardiso/`` — direct ``pardiso_()`` wrapper via nanobind. **Measured on the SOLID185 K_ff-only bench (4 threads, MAX_CORE_SIZE=500 MB)**: .. list-table:: :widths: 15 18 18 18 15 15 :header-rows: 1 * - size - in-core peak - OOC peak - Δ peak - in-core factor - OOC factor * - 80×80×2 - 487 MB - 487 MB - 0 (fits in core) - 0.30 s - 0.58 s * - 120×120×2 - 906 MB - 911 MB - +5 MB - 0.74 s - 1.28 s * - 160×160×2 - 1497 MB - **1214 MB** - **−283 MB (−19%)** - 1.41 s - 2.62 s * - 200×200×2 - **2338 MB** - **1540 MB** - **−798 MB (−34%)** - 2.43 s - 4.24 s **This is the biggest memory win of the investigation.** The gap grows with problem size: - Small meshes (<500 MB factor): OOC equals in-core — the core budget is big enough to hold everything, MKL never spills. - Large meshes (>500 MB factor): OOC streams the overflow to disk, capping peak RSS near ``MAX_CORE_SIZE`` + K / M / eigensolve workspace. Residual is identical (4.3e-11 on both at 200×200×2) — no accuracy cost. Factor time is ~70% slower because disk IO is in the critical path; solve time is ~6× slower for the same reason (each back-solve touches the on-disk factor blocks). **Tradeoff calculus at 200×200×2**: - Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor. - Pardiso (our direct wrapper, OOC): **1540 MB peak**, 4.24 s factor. - MUMPS: 2304 MB peak, 4.38 s factor. - SPRAL: 3694 MB peak, 13.66 s factor. The direct-MKL OOC path beats every other solver tested on peak RSS while remaining competitive on wall time. **Integration**: ``pymklpardiso/src/pymklpardiso/femorph_integration.py`` exposes two adapter classes: - ``MklPardisoOOCSolver`` — iparm(60)=2, name ``mklp_ooc``. Available only when ``MKL_PARDISO_OOC_PATH`` is set. - ``MklPardisoInCoreSolver`` — iparm(60)=0, name ``mklp_direct``. Useful for A/B benches against pypardiso without OOC noise. Users opt in via:: from pymklpardiso.femorph_integration import MklPardisoOOCSolver _REGISTRY["mklp_ooc"] = MklPardisoOOCSolver # Set MKL_PARDISO_OOC_PATH + MKL_PARDISO_OOC_MAX_CORE_SIZE in env model.modal_solve(linear_solver="mklp_ooc") The two required env variables mean ``available()`` returns False by default — users explicitly opt in by exporting them. **Note on the pypardiso vs direct-MKL behavioural difference**: we don't yet know *exactly* what pypardiso does to iparm(60), only that the readback differs. Worth filing an upstream issue with pypardiso — the current behaviour silently downgrades OOC without telling the caller. Our wrapper is the workaround. Full candidate-search summary ----------------------------- To close the loop on what was investigated: .. list-table:: :widths: 20 15 12 53 :header-rows: 1 * - Solver - Built? - Works? - Final position vs Pardiso on SOLID185 modal * - **MKL Pardiso** - N/A (shipped) - ✅ - **memory leader**, reference * - **MUMPS** - ✅ - ✅ - speed alternative (−18%) at moderate memory cost (+313 MB) * - **SPRAL SSIDS** - ✅ - ✅ - fastest (−8 to −10%) but massive memory cost (+1556 MB at 200×200×2) * - **CHOLMOD supernodal** - ✅ - ✅ (w/ MKL preload) - −33 MB on K_ff-only, 3× slower; flat on full pipeline * - **STRUMPACK (exact)** - ✅ - ✅ (w/ MKL preload + getrf patch) - loses on both axes * - **STRUMPACK (HSS)** - ✅ - ✅ - worse memory + degraded accuracy on hex elasticity * - **Eigen SimplicialLDLT** - ✅ - ✅ - disqualified; scalar simplicial loses everywhere * - **Eigen CG + IC(0)** - ✅ - ✅ - **lowest memory (-447 MB at 160×160×2)**; 44× slower, 1e-6 residual ceiling * - CHOLMOD simplicial - ✅ - ✅ - slow and dense; no use case * - SuperLU 7.0.1 (NixOS) - ❌ - — - LU (not SPD-exploiting); skip — won't beat Cholesky solvers * - PaStiX - ❌ - — - not on NixOS; build-from-source deferred * - HSL MA87 / MA97 - ❌ - — - academic license not obtained * - Schenk academic PARDISO 7 - ❌ - — - different-from-Intel fork; download + build deferred * - Ginkgo / cuDSS / Amesos2 - ❌ - — - not on NixOS; heavy build cost; deferred The candidates left in "not attempted" are all multi-hour builds for a speculative gain — HSL MA87 is the most promising (specifically designed for low-memory SPD Cholesky and used alongside SPRAL by STFC) but requires registering for the academic license. PaStiX is a plausible second option but its INRIA build system + Scotch + BLAS threading requires serious packaging work on NixOS. **Bottom line** after the full sweep: - **``DirectMklPardisoSolver(ooc=True)`` is the memory winner** — ``-767 MB`` (``-32 %``) at 200×200×2 with the iparm(23)/(24)- disable-when-OOC patch (``fix/mkl-pardiso-ooc-disable-fast-parallel`` branch). Identical accuracy, ~2× longer solve. This is the recommended path when peak RSS is binding. - **Keep pypardiso's ``Pardiso`` as the in-core default** — best all-around speed/memory for users who don't enable OOC. - **Expose MUMPS as ``linear_solver="mumps"``** — clean 15–20% modal speedup at +300 MB peak RSS cost. Good for speed-first, RAM-rich users. - **Expose SPRAL as ``linear_solver="spral"``** — fastest direct solver on small/medium sizes, +1.5 GB memory cost at 200×200×2. - **Expose Eigen CG+IC0 as ``linear_solver="cg_ic0"``** — alternate memory-reduction path for static only; 1e-6 residual ceiling, 44× slower than Pardiso. Worth having for memory-constrained static workflows that accept single-precision accuracy. - **STRUMPACK / CHOLMOD simplicial / Eigen LDLT** — don't land; they lose on both axes of the full-pipeline bench. - **HSL MA87 + PaStiX** — follow-up candidates if we hit the MKL OOC floor and need to push further. Both require external source access (HSL has academic-license gate, PaStiX's INRIA GitLab host wasn't in the sandbox's trusted-SCM list and would need explicit permission to clone). Supporting artefacts -------------------- The raw bench data + scripts are under ``mapdl-re/`` (local-only): - ``pycholmod/bench_vs_pardiso.py`` — cross-solver K_ff-only bench - ``pycholmod/bench_strumpack_mumps.py`` — STRUMPACK + MUMPS additions - ``pycholmod/bench_strumpack_hss.py`` — HSS tolerance sweep - ``pymumps/bench_integrated_static.py`` — full static pipeline - ``pymumps/bench_integrated_modal.py`` — full modal pipeline - ``pymumps/test_integrated_modal.py`` — drop-in parity verification - ``pymumps/src/pymumps/femorph_integration.py`` — the :class:`MumpsSolver` adapter itself - ``SOLVER_COMPARISON.md`` at the workspace root — the summary that this page was distilled from