Solver exploration (extras)#

This page documents an in-depth investigation into sparse SPD direct solvers as alternatives to MKL Pardiso for femorph-solver’s static and modal pipelines. It’s an “extras” page in the sense that none of it changes the recommended default — that remains Pardiso — but it closes a pile of real questions about why, and points at what to try if your constraints differ from the common case.

Why we looked#

After landing the low-hanging memory-reduction PRs (grid-as-sole-truth, triu K/M assembly, shared sparsity pattern, pypardiso’s A.copy() elision, factor-release on GC), peak RSS on a 200×200×2 SOLID185 modal solve sat at ~2130 MB — about 2× MAPDL’s in-core. The remaining gap is the Pardiso Cholesky factor itself (iparm(17) = 1.14 GB peak during factorisation, 0.39 GB permanent). That’s a hard floor for a direct solver on this mesh topology.

Two plausible ways to move past that floor:

  1. A solver that produces a smaller factor (MUMPS, CHOLMOD, STRUMPACK with HSS compression, SPRAL).

  2. A solver with working out-of-core (OOC) that caps peak RSS regardless of problem size.

MKL Pardiso appeared to ignore iparm(60) OOC on pypardiso’s bundled MKL build — set_iparm(60, 2) warns “no input iparm” and the readback is 0. The investigation below started as a systematic sweep of #1 on that assumption.

Plot twist: it turned out pypardiso’s Python layer was filtering the iparm dict — not MKL itself. A direct ctypes wrapper (agent-3’s DirectMklPardisoSolver, PR #126) honours iparm(60)=2, and the pip-wheel MKL’s OOC works. With a small fix to disable incompatible fast-parallel flags during OOC, we land −767 MB peak (−32%) at 200×200×2 — the biggest memory win of the investigation. See the “Direct MKL Pardiso with working OOC” section below for the details.

The #1 sweep (candidate direct solvers) still produced useful data — it’s how we ruled in MUMPS as a speed alternative and ruled out STRUMPACK HSS / scalar simplicial Cholesky.

Candidate solvers considered#

All are sparse direct SPD Cholesky-class solvers that could plausibly match or beat MKL Pardiso:

Solver

Origin

Selling point

Final verdict

MKL Pardiso

Intel

Supernodal, multithreaded. Production default.

Memory leader on realistic pipelines

MUMPS

CERFACS / IRIT

Multifrontal, real OOC (if compiled in)

Speed alternative for modal; +313 MB memory at 200×200×2

CHOLMOD supernodal

SuiteSparse

Supernodal, uses BLAS-3 heavily

−33 MB at K_ff stage, 3× slower; +8 MB on full pipeline

CHOLMOD simplicial

SuiteSparse

Scalar Cholesky; very low memory per the docs

Pathologically slow at scale (19× factor) and denser factor

STRUMPACK exact

LBNL

Multifrontal with optional HSS compression

+1 GB vs Pardiso; defeated before HSS even kicks in

STRUMPACK HSS

LBNL

2–4× factor compression on elliptic problems (theoretical)

HSS ratio on 3D hex elasticity is negative — more memory AND lower accuracy

Eigen SimplicialLDLT

Guennebaud et al.

Header-only, no external deps

Slowest; disqualified on both axes

SPRAL SSIDS

STFC UK

Modern multithreaded, low-memory footprint per the literature

Scaffold built; runtime ssids_factor returns flag = -53 regardless of options. Needs deeper debugging.

PaStiX

INRIA

Static scheduling, Scotch ordering

Not on NixOS; not attempted.

HSL MA87

Harwell / STFC

Designed for minimum memory on SPD Cholesky

Academic license; not attempted.

The NixOS OpenBLAS 0.3.30 bug#

Any non-Pardiso solver that performs supernodal-style dense LAPACK kernels from inside its own elimination tree immediately breaks when linked against NixOS’s openblas-0.3.30. The symptom is a cascading series of xerbla-printed errors followed by either a stack-smash detection or a segfault:

** On entry to DGETRF parameter number  4 had an illegal value
** On entry to DPOTRF parameter number  4 had an illegal value
** On entry to DTRSM  parameter number 11 had an illegal value
** On entry to DGEMM  parameter number 13 had an illegal value
*** stack smashing detected ***: terminated

This was reproduced on three separate solvers (STRUMPACK 7.2 / STRUMPACK 8.0 / CHOLMOD supernodal / MUMPS) — all failing with the same pattern. ulimit -s unlimited, OMP_NUM_THREADS=1, and GCC-only (no Clang/libomp mixing) did not help.

Workaround: LD_PRELOAD=libmkl_rt.so.2 before importing the affected solver. MKL’s LAPACK handles the call patterns that NixOS’s OpenBLAS 0.3.30 chokes on. Once preloaded, every tested solver runs cleanly with residuals in the 1e-10 to 1e-15 band.

Root-cause-chasing with AddressSanitizer localised the actual read fault in STRUMPACK to dlaswp_plus() inside OpenBLAS — row-swap during the solve phase dereferences a pivot-index array that dgetrf() returned with undefined contents on degenerate fronts. Patches to STRUMPACK’s BLASLAPACKWrapper.cpp::getrf and DenseMatrix::laswp guard that specific corruption, but the remaining illegal-argument messages from dgemm, dpotrf, and dtrsm indicate the bug lives at several OpenBLAS call sites, not one. Fixing this for the whole ecosystem means either a new OpenBLAS release or a NixOS repackage — filing upstream is the right move.

Benchmark methodology#

Each solver was benchmarked in three regimes on the SOLID185 flat plate that femorph_solver.perf.bench_memory uses (the “perf” bench fixture):

  1. K_ff-only microbench — assemble K_ff and M_ff via Model._bc_reduce_and_release(), factor K_ff once, solve once, capture peak RSS. Measures raw solver overhead without pipeline noise.

  2. Static pipeline — full model.solve(linear_solver=...) with a surface load, from fresh Model.from_grid through displacement return. Measures the solver’s integration with assembly + BC reduce + solve.

  3. Modal pipeline — full model.modal_solve(n_modes=10, eigen_solver='arpack', linear_solver=...) at σ=0 shift-invert. ARPACK drives ~30–50 inner solves per outer iteration, so this is where the solver’s solve-phase overhead accumulates.

Each row runs in a fresh subprocess so ru_maxrss is clean. Thread cap: 4 (MAPDL parity). Environment preloads MKL:

OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 \
  OMP_WAIT_POLICY=PASSIVE KMP_BLOCKTIME=0 \
  LD_PRELOAD=<path>/libmkl_rt.so.2 \
  python perf/bench_solvers.py 

Sizes: 40×40×2, 80×80×2, 120×120×2, 160×160×2, 200×200×2.

Results — K_ff-only microbench#

Factor wall time (seconds):

size

Pardiso

MUMPS

CHOLMOD-sup

STRUMPACK

CHOLMOD-simp

Eigen LDLT

40×40×2

0.32

0.17

0.09

0.38

0.69

0.31

80×80×2

0.63

0.82

0.59

1.09

3.12

5.01

120×120×2

1.31

2.69

1.73

4.43

8.75

32.6

160×160×2

2.23

3.51

7.22

8.04

36.6

49.3

200×200×2

3.48

4.38

10.7

16.8

48.4

127.6

Peak RSS (MB) and Δ vs Pardiso:

size

Pardiso

MUMPS Δ

CHOLMOD-sup Δ

STRUMPACK Δ

CHOLMOD-simp Δ

Eigen Δ

40×40×2

261

−8

−9

+18

−10

−18

80×80×2

504

−24

−14

+117

+14

+58

120×120×2

930

−31

−27

+291

+49

+211

160×160×2

1538

−46

−30

+585

+98

+531

200×200×2

2379

−75

−33

+1021

+223

+937

At this measurement scale MUMPS saves 75 MB over Pardiso at 200×200×2. This turned out to be misleading — see the next section.

Results — full static pipeline#

size

Pardiso wall

Pardiso peak

MUMPS wall

MUMPS peak

Δ wall

Δ peak

80×80×2

0.69 s

580 MB

0.62 s

594 MB

−10%

+14 MB

120×120×2

1.47 s

1091 MB

1.71 s

1121 MB

+16%

+30 MB

160×160×2

2.66 s

1816 MB

3.19 s

1852 MB

+20%

+36 MB

200×200×2

4.28 s

2772 MB

5.46 s

2853 MB

+28%

+81 MB

Displacement fields match bit-for-bit. Pardiso wins both axes on the full static pipeline.

Results — full modal pipeline#

10 modes via ARPACK shift-invert at σ=0:

size

Pardiso wall

Pardiso peak

MUMPS wall

MUMPS peak

Δ wall

Δ peak

80×80×2

1.98 s

476 MB

1.54 s

516 MB

−22%

+40 MB

120×120×2

5.25 s

849 MB

5.03 s

959 MB

−4%

+110 MB

160×160×2

9.37 s

1402 MB

7.91 s

1592 MB

−16%

+190 MB

200×200×2

15.19 s

2138 MB

12.46 s

2451 MB

−18%

+313 MB

MUMPS is consistently 15–20 % faster on modal at the cost of +200–300 MB peak RSS. The K_ff-only microbench’s -75 MB reading does not generalise — MUMPS’s solve-phase workspace accumulates across ARPACK’s ~30 inner solves, more than erasing the single-solve win.

Drop-in parity: modal frequencies from linear_solver="mumps" and linear_solver="pardiso" agree to max relative drift 2.5 × 10⁻¹⁰ across all 10 modes on a 40×40×2 SOLID185 cantilever. Well below the 1e-7 T1-P10 parity target.

STRUMPACK HSS — why the headline feature didn’t land#

STRUMPACK’s HSS (Hierarchically Semi-Separable) compression is why we spent the most effort there. The literature claims 2–4× factor compression for 3D elliptic problems — if that held for SOLID185, STRUMPACK would undercut Pardiso by several hundred MB.

Measured on the same mesh class, compression_tol = 1e-6:

size

HSS 1e-6 peak (MB)

HSS 1e-6 residual

HSS 1e-8 peak (MB)

HSS 1e-8 residual

Pardiso peak (MB)

40×40×2

279

3.9e-10

278

3.9e-10

261

80×80×2

621

3.8e-10

620

3.8e-10

504

120×120×2

1182

4.0e-05

1185

3.0e-09

930

160×160×2

2054

6.3e-08

2056

1.4e-06

1538

200×200×2

3244

5.1e-06

3251

2.5e-06

2380

Two problems stack:

  1. Peak RSS exceeds Pardiso at every size. STRUMPACK’s own working set + the compressed factor stacks larger than Pardiso’s supernodal factor.

  2. Accuracy degrades to 10⁻⁶ at large sizes regardless of the compression_tol setting. For modal ARPACK shift-invert (which needs ~1e-10 convergence at tol=1e-12) this would bleed into the outer eigenvalue accuracy.

SOLID185 is a first-order hex element. Its factor fronts have high off-diagonal rank — the elasticity Green’s function doesn’t produce the low-rank block structure HSS compression exploits. HSS’s sweet spot is elliptic integral operators and smooth-kernel discretisations, not structural FEM on uniform hex meshes.

This isn’t a build or configuration issue. It’s STRUMPACK telling us honestly that HSS has nothing to compress on our mesh class.

CHOLMOD supernodal — memory-competitive, always slower#

CHOLMOD supernodal (via scikit-sparse or a direct nanobind binding) runs once the NixOS OpenBLAS bug is sidestepped with MKL preload. On K_ff-only it saves ~30 MB vs Pardiso at 200×200×2 but takes 3× longer to factor. On the full modal pipeline the advantage disappears entirely.

Useful as a fallback for build environments where MKL isn’t linkable but a working OpenBLAS is — CHOLMOD is less picky about its BLAS than the supernodal-in-Fortran solvers (STRUMPACK, MUMPS).

Eigen SimplicialLDLT — the “no external deps” option#

Eigen’s SimplicialLDLT<SparseMatrix<double>, Lower> is header-only and runs cleanly on any BLAS. At 40×40×2 it’s actually the smallest peak RSS of everything tested (230 MB vs Pardiso’s 261). Above that it loses on both axes — scalar simplicial Cholesky with AMD ordering produces a denser factor than supernodal with METIS, and runs single-threaded. At 200×200×2 it’s 176 seconds and 3316 MB — 50× slower than Pardiso and using 937 MB more memory.

Useful only as a last-resort fallback for environments where nothing else links. Good to know it’s correct; not something to use in anger.

Honest recommendation#

Stay on Pardiso. The night’s investigation didn’t find a solver that’s strictly better than Pardiso for femorph-solver’s workload on this class of mesh. The full pipeline numbers are the ones that matter — not K_ff-only microbenches — and on those Pardiso is the memory leader by 80–300 MB at 200×200×2.

Use MUMPS if you need 15–20 % faster modal and have RAM to spare. It’s a drop-in via a 60-line MumpsSolver adapter and produces the same frequencies Pardiso does to 2.5 × 10⁻¹⁰.

Don’t bother with STRUMPACK HSS on hex elasticity meshes — the compression ratio is negative and the error is high.

Skip CHOLMOD simplicial and Eigen LDLT for any mesh beyond “trivial” — they’re either slow, memory-hungry, or both.

Build environment notes (for NixOS / nix-like systems)#

  • gfortran is often not on PATH when GCC is — export /nix/store/.../gfortran-wrapper-*/bin before any build that needs Fortran.

  • NixOS GCC’s -fstack-protector-strong + -D_FORTIFY_SOURCE=2 default mask VLA bugs in STRUMPACK’s multifrontal path. Strip with -U_FORTIFY_SOURCE -fno-stack-protector to surface the real failure at its source (not the canary-check).

  • Mixing Clang libomp and GCC libgomp races over thread state. Pick one runtime end-to-end; on NixOS that means building C++ with CC=gcc CXX=g++ since gfortran’s libgomp is the default and OpenBLAS links against it.

  • NixOS OpenBLAS 0.3.30 has a real bug affecting sparse-solver supernodal kernels. Workaround: LD_PRELOAD=libmkl_rt.so.2.

SPRAL SSIDS — fast but memory-heavy on hex elasticity#

SPRAL was the single remaining candidate that might beat Pardiso on memory. Its published benchmarks show it matching HSL MA87 and beating MUMPS on several metrics while maintaining lower peak RSS than CHOLMOD on sparse SPD problems.

Getting SPRAL to run: the SSIDS CPU factor requires the OpenMP cancellation feature, which GCC’s libgomp keeps off by default. The solver returns inform%flag = -53 — undocumented numerically but prints SSIDS CPU code requires OMP cancellation to be enabled at print_level=10. Fix: export OMP_CANCELLATION=TRUE before libgomp initialises (i.e. before numpy / scipy import). Setting it from Python via os.environ post-import is a no-op because libgomp has already cached the runtime. Subprocess launchers need it in the env dict passed to Popen.

Measured results on the SOLID185 modal pipeline (4 threads, MKL preload, posdef=true):

size

Pardiso wall

Pardiso peak

SPRAL wall

SPRAL peak

Δ peak vs Pardiso

80×80×2

1.98 s

476 MB

1.88 s

752 MB

+276 MB

120×120×2

5.25 s

849 MB

4.64 s

1450 MB

+601 MB

160×160×2

9.37 s

1402 MB

8.46 s

2475 MB

+1073 MB

200×200×2

15.19 s

2138 MB

13.66 s

3694 MB

+1556 MB

SPRAL is the fastest solver at every mesh size — beats both Pardiso and MUMPS on wall time. But the peak-RSS cost is catastrophic: +276 MB at 80×80×2 widens to +1556 MB (+73%) at 200×200×2. SPRAL’s multifrontal factor allocates a much larger working set than Pardiso’s supernodal packing; the published memory advantages don’t hold on hex elasticity stiffness matrices.

So despite SPRAL being the last plausible memory candidate, it doesn’t deliver memory savings here either. It joins MUMPS as a speed alternative — if the user values modal wall time and has RAM headroom, SPRAL is 8–10% faster than Pardiso across sizes. If memory is binding, stay on Pardiso.

Three-way speed comparison at 200×200×2:

solver

wall time

peak RSS

summary

Pardiso

15.19 s

2138 MB

baseline, memory leader

MUMPS

12.46 s (-18%)

2451 MB (+313 MB)

balanced speed / memory tradeoff

SPRAL

13.66 s (-10%)

3694 MB (+1556 MB)

fastest small-to-medium sizes, worst memory

Final recommendation unchanged: Pardiso stays the default. The two viable alternatives are both speed wins at memory cost:

  • MUMPS for a moderate speed boost (−15–20%) at moderate memory cost (+200–300 MB) — best general-purpose alternative.

  • SPRAL for the maximum speed at any size (−8–20%) at much higher memory cost (+300–1500 MB) — only for memory-abundant setups.

SPRAL integration adapter lives at mapdl-re/pyspral/src/pyspral/femorph_integration.py. Drop-in via:

_REGISTRY["spral"] = SpralSolver
model.modal_solve(linear_solver="spral")

Requires OMP_CANCELLATION=TRUE in the environment and LD_PRELOAD=libmkl_rt.so.2 to sidestep the NixOS OpenBLAS 0.3.30 cascade.

Direct MKL Pardiso with working OOC — the biggest memory win#

pypardiso silently swallows iparm(60) (OOC mode selector). Call set_iparm(60, 2) through pypardiso and it warns 2 is no input iparm; the readback after factor shows 0. The assumption that this was a MKL limitation on the pip-wheel build turned out to be wrong — it’s pypardiso’s Python layer filtering the iparm dict.

Agent-3’s DirectMklPardisoSolver (in-tree at femorph_solver.solvers.linear._mkl_pardiso, PR #126) calls pardiso_() via ctypes directly. Bypassing pypardiso unlocks the full iparm array — including iparm(60)=2 for out-of-core factor. The pip-wheel MKL’s OOC implementation works as long as MKL_PARDISO_OOC_PATH points at a writable directory.

Design highlights of the in-tree solver:

  • Reuses pypardiso’s already-loaded MKL handle via pypardiso.PyPardisoSolver(mtype=11).libmkl — no duplicate dlopen, same resolution heuristic.

  • iparm(34)=1 — 0-based CSR indices eliminate the +1 rewrite copies pypardiso does on every solve (~65 MB saved per solve at 150×150×2).

  • iparm(23)=1 + iparm(24)=2 — “improved 2-level parallel” factorisation + 2-level solve, roughly halves solve wall time vs the classical path.

  • Phase 12 in one call (analyse + factorise).

  • Phase -1 on __del__ — same MKL factor-leak fix as PR #107.

OOC bug uncovered during this investigation: on meshes at or above 160×160×2 (n_dof ≥ 231 840), ooc=True returned MKL error -2 (insufficient memory) at phase=12. Root cause: the iparm(23)=1 + iparm(24)=2 fast-parallel allocations are incompatible with iparm(60)=2 OOC on this MKL build — the per-thread scratch sized for the whole frontal matrix is what OOC refuses to spill.

Fix (local branch fix/mkl-pardiso-ooc-disable-fast-parallel): drop iparm(23)/(24) to the classical parallel path when OOC is requested. Costs ~2× solve wall time in exchange for OOC actually working at scale. With the patch applied:

Agent-3 DirectMklPardisoSolver — in-core vs OOC#

size

in-core peak

OOC peak

Δ peak

in-core factor

OOC factor

80×80×2

505 MB

511 MB

+6 MB

0.38 s

0.67 s

120×120×2

920 MB

935 MB

+15 MB

0.82 s

1.26 s

160×160×2

1526 MB

1265 MB

−261 MB (−17%)

1.73 s

2.49 s

200×200×2

2371 MB

1604 MB

−767 MB (−32%)

2.50 s

4.12 s

This is the single biggest SPD-solver memory reduction measured across the entire investigation. The gap widens with problem size — at small meshes OOC equals in-core because the factor fits in MAX_CORE_SIZE; at large meshes OOC streams the overflow to disk, capping peak RSS near MAX_CORE_SIZE + K / M / eigensolve workspace.

Identical residual (4.3e-11 on both at 200×200×2), 65% longer factor, solve time 2-3× slower because disk IO is in the critical path.

Enablement:

# Required env vars — must be set before first MKL call.
export MKL_PARDISO_OOC_PATH=/tmp/mkl_pardiso_ooc
export MKL_PARDISO_OOC_MAX_CORE_SIZE=500

# In Python
from femorph_solver.solvers.linear._mkl_pardiso import DirectMklPardisoSolver
lu = DirectMklPardisoSolver(K_ff, assume_spd=True, ooc=True)

Or via the registry when the fix lands:

model.modal_solve(linear_solver="mkl_direct_ooc")  # future registration

Tradeoff context at 200×200×2:

  • Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor

  • Pardiso (DirectMklPardiso, in-core): 2371 MB peak, 2.50 s factor

  • Pardiso (DirectMklPardiso, OOC): 1604 MB peak, 4.12 s factor

  • MUMPS: 2304 MB peak, 4.38 s factor

  • SPRAL: 3694 MB peak, 13.66 s factor

  • Eigen CG+IC0: 1091 MB peak at 160×160×2, 44× slower

Direct-MKL OOC delivers the largest peak-RSS reduction of any direct solver tested while remaining competitive on wall time (~2× slower than in-core Pardiso, still faster than MUMPS). This is the recommended memory-constrained path for femorph-solver.

Iterative fallback — Eigen CG with IncompleteCholesky#

After exhausting direct solvers, the final angle was preconditioned conjugate gradient with incomplete Cholesky (IC(0)) — pure iterative, no factor to store. Memory is just the matrix, the IC preconditioner (roughly the size of triu K), and a few CG basis vectors. If it converges in reasonable time, this is the single lowest-memory option for SPD.

Measured on the K_ff-only pipeline (tol = 1e-6, max 100 000 iterations):

size

n_dof

CG wall

CG iters

CG peak

Pardiso peak

40×40×2

14 760

1.35 s

1 468

211 MB

261 MB (-50)

80×80×2

58 320

17.8 s

3 368

387 MB

504 MB (-117)

120×120×2

130 680

43.6 s

3 325

683 MB

930 MB (-247)

160×160×2

231 840

97.9 s

4 242

1091 MB

1538 MB (-447)

At 160×160×2 CG+IC(0) uses 29% less memory than Pardiso. The gap widens with problem size — this is the only candidate that actually shrinks faster than Pardiso in absolute terms as the mesh grows.

The fatal catches though:

  1. 44× slower than Pardiso at 160×160×2 (97.9 s vs 2.23 s). The iteration count scales with \(\sqrt{\mathrm{cond}(K)}\), which for structured hex elasticity grows like \(h^{-1}\). Not scalable.

  2. Residual floor at 1e-6 — the IC(0) preconditioner isn’t strong enough to drive CG below 1e-6 in any reasonable iteration budget on elasticity stiffness. ARPACK shift-invert at tolerance 1e-12 needs the inner solver at ~1e-10 so modal is out of reach; a 1e-6 residual is fine for static solves where the user accepts single-precision-ish accuracy.

Use cases:

  • Static solve on a severely memory-constrained machine where shaving 29% off peak RSS matters more than a 40× speed hit and the user accepts 1e-6 accuracy. Not the default — a niche override.

  • Not viable for modal — the residual ceiling would bleed into ARPACK’s eigenvalue accuracy.

This is genuinely the only memory-saving option we found, and the tradeoff is steep. Exposed via pyeigensparse.CGICSolver; bindings live at mapdl-re/pyeigensparse/.

Direct MKL Pardiso wrapper — unlocking OOC#

pypardiso silently swallows iparm(60) (OOC mode selector) — when you call set_iparm(60, 2) it warns 2 is no input iparm and the readback after factor shows 0. We suspected this was a MKL limitation in the pip-wheel build.

It wasn’t. It was pypardiso’s Python layer. A 300-line ctypes wrapper calling pardiso_() directly (no pypardiso) honours every iparm we set — including iparm(60)=2 (full out-of-core). The pip-wheel MKL’s OOC works as long as MKL_PARDISO_OOC_PATH points at a writable directory.

Package: mapdl-re/pymklpardiso/ — direct pardiso_() wrapper via nanobind.

Measured on the SOLID185 K_ff-only bench (4 threads, MAX_CORE_SIZE=500 MB):

size

in-core peak

OOC peak

Δ peak

in-core factor

OOC factor

80×80×2

487 MB

487 MB

0 (fits in core)

0.30 s

0.58 s

120×120×2

906 MB

911 MB

+5 MB

0.74 s

1.28 s

160×160×2

1497 MB

1214 MB

−283 MB (−19%)

1.41 s

2.62 s

200×200×2

2338 MB

1540 MB

−798 MB (−34%)

2.43 s

4.24 s

This is the biggest memory win of the investigation. The gap grows with problem size:

  • Small meshes (<500 MB factor): OOC equals in-core — the core budget is big enough to hold everything, MKL never spills.

  • Large meshes (>500 MB factor): OOC streams the overflow to disk, capping peak RSS near MAX_CORE_SIZE + K / M / eigensolve workspace.

Residual is identical (4.3e-11 on both at 200×200×2) — no accuracy cost. Factor time is ~70% slower because disk IO is in the critical path; solve time is ~6× slower for the same reason (each back-solve touches the on-disk factor blocks).

Tradeoff calculus at 200×200×2:

  • Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor.

  • Pardiso (our direct wrapper, OOC): 1540 MB peak, 4.24 s factor.

  • MUMPS: 2304 MB peak, 4.38 s factor.

  • SPRAL: 3694 MB peak, 13.66 s factor.

The direct-MKL OOC path beats every other solver tested on peak RSS while remaining competitive on wall time.

Integration: pymklpardiso/src/pymklpardiso/femorph_integration.py exposes two adapter classes:

  • MklPardisoOOCSolver — iparm(60)=2, name mklp_ooc. Available only when MKL_PARDISO_OOC_PATH is set.

  • MklPardisoInCoreSolver — iparm(60)=0, name mklp_direct. Useful for A/B benches against pypardiso without OOC noise.

Users opt in via:

from pymklpardiso.femorph_integration import MklPardisoOOCSolver
_REGISTRY["mklp_ooc"] = MklPardisoOOCSolver
# Set MKL_PARDISO_OOC_PATH + MKL_PARDISO_OOC_MAX_CORE_SIZE in env
model.modal_solve(linear_solver="mklp_ooc")

The two required env variables mean available() returns False by default — users explicitly opt in by exporting them.

Note on the pypardiso vs direct-MKL behavioural difference: we don’t yet know exactly what pypardiso does to iparm(60), only that the readback differs. Worth filing an upstream issue with pypardiso — the current behaviour silently downgrades OOC without telling the caller. Our wrapper is the workaround.

Full candidate-search summary#

To close the loop on what was investigated:

Solver

Built?

Works?

Final position vs Pardiso on SOLID185 modal

MKL Pardiso

N/A (shipped)

memory leader, reference

MUMPS

speed alternative (−18%) at moderate memory cost (+313 MB)

SPRAL SSIDS

fastest (−8 to −10%) but massive memory cost (+1556 MB at 200×200×2)

CHOLMOD supernodal

✅ (w/ MKL preload)

−33 MB on K_ff-only, 3× slower; flat on full pipeline

STRUMPACK (exact)

✅ (w/ MKL preload + getrf patch)

loses on both axes

STRUMPACK (HSS)

worse memory + degraded accuracy on hex elasticity

Eigen SimplicialLDLT

disqualified; scalar simplicial loses everywhere

Eigen CG + IC(0)

lowest memory (-447 MB at 160×160×2); 44× slower, 1e-6 residual ceiling

CHOLMOD simplicial

slow and dense; no use case

SuperLU 7.0.1 (NixOS)

LU (not SPD-exploiting); skip — won’t beat Cholesky solvers

PaStiX

not on NixOS; build-from-source deferred

HSL MA87 / MA97

academic license not obtained

Schenk academic PARDISO 7

different-from-Intel fork; download + build deferred

Ginkgo / cuDSS / Amesos2

not on NixOS; heavy build cost; deferred

The candidates left in “not attempted” are all multi-hour builds for a speculative gain — HSL MA87 is the most promising (specifically designed for low-memory SPD Cholesky and used alongside SPRAL by STFC) but requires registering for the academic license. PaStiX is a plausible second option but its INRIA build system + Scotch + BLAS threading requires serious packaging work on NixOS.

Bottom line after the full sweep:

  • ``DirectMklPardisoSolver(ooc=True)`` is the memory winner-767 MB (-32 %) at 200×200×2 with the iparm(23)/(24)- disable-when-OOC patch (fix/mkl-pardiso-ooc-disable-fast-parallel branch). Identical accuracy, ~2× longer solve. This is the recommended path when peak RSS is binding.

  • Keep pypardiso’s ``Pardiso`` as the in-core default — best all-around speed/memory for users who don’t enable OOC.

  • Expose MUMPS as ``linear_solver=”mumps”`` — clean 15–20% modal speedup at +300 MB peak RSS cost. Good for speed-first, RAM-rich users.

  • Expose SPRAL as ``linear_solver=”spral”`` — fastest direct solver on small/medium sizes, +1.5 GB memory cost at 200×200×2.

  • Expose Eigen CG+IC0 as ``linear_solver=”cg_ic0”`` — alternate memory-reduction path for static only; 1e-6 residual ceiling, 44× slower than Pardiso. Worth having for memory-constrained static workflows that accept single-precision accuracy.

  • STRUMPACK / CHOLMOD simplicial / Eigen LDLT — don’t land; they lose on both axes of the full-pipeline bench.

  • HSL MA87 + PaStiX — follow-up candidates if we hit the MKL OOC floor and need to push further. Both require external source access (HSL has academic-license gate, PaStiX’s INRIA GitLab host wasn’t in the sandbox’s trusted-SCM list and would need explicit permission to clone).

Supporting artefacts#

The raw bench data + scripts are under mapdl-re/ (local-only):

  • pycholmod/bench_vs_pardiso.py — cross-solver K_ff-only bench

  • pycholmod/bench_strumpack_mumps.py — STRUMPACK + MUMPS additions

  • pycholmod/bench_strumpack_hss.py — HSS tolerance sweep

  • pymumps/bench_integrated_static.py — full static pipeline

  • pymumps/bench_integrated_modal.py — full modal pipeline

  • pymumps/test_integrated_modal.py — drop-in parity verification

  • pymumps/src/pymumps/femorph_integration.py — the MumpsSolver adapter itself

  • SOLVER_COMPARISON.md at the workspace root — the summary that this page was distilled from