Solver exploration (extras)#

This page documents an in-depth investigation into sparse SPD direct solvers as alternatives to MKL Pardiso for femorph-solver’s static and modal pipelines. It’s an “extras” page in the sense that none of it changes the recommended default — that remains Pardiso — but it closes a pile of real questions about why, and points at what to try if your constraints differ from the common case.

Why we looked #

After landing the low-hanging memory-reduction PRs (grid-as-sole-truth, triu K/M assembly, shared sparsity pattern, pypardiso’s A.copy() elision, factor-release on GC), peak RSS on a 200×200×2 SOLID185 modal solve sat at ~2130 MB — about 2× MAPDL’s in-core. The remaining gap is the Pardiso Cholesky factor itself (iparm(17) = 1.14 GB peak during factorisation, 0.39 GB permanent). That’s a hard floor for a direct solver on this mesh topology.

Two plausible ways to move past that floor:

A solver that produces a smaller factor (MUMPS, CHOLMOD, STRUMPACK with HSS compression, SPRAL).
A solver with working out-of-core (OOC) that caps peak RSS regardless of problem size.

MKL Pardiso appeared to ignore iparm(60) OOC on pypardiso’s bundled MKL build — set_iparm(60, 2) warns “no input iparm” and the readback is 0. The investigation below started as a systematic sweep of #1 on that assumption.

Plot twist: it turned out pypardiso’s Python layer was filtering the iparm dict — not MKL itself. A direct ctypes wrapper (agent-3’s DirectMklPardisoSolver, PR #126) honours iparm(60)=2, and the pip-wheel MKL’s OOC works. With a small fix to disable incompatible fast-parallel flags during OOC, we land −767 MB peak (−32%) at 200×200×2 — the biggest memory win of the investigation. See the “Direct MKL Pardiso with working OOC” section below for the details.

The #1 sweep (candidate direct solvers) still produced useful data — it’s how we ruled in MUMPS as a speed alternative and ruled out STRUMPACK HSS / scalar simplicial Cholesky.

Candidate solvers considered #

All are sparse direct SPD Cholesky-class solvers that could plausibly match or beat MKL Pardiso:

Solver	Origin	Selling point	Final verdict
MKL Pardiso	Intel	Supernodal, multithreaded. Production default.	Memory leader on realistic pipelines
MUMPS	CERFACS / IRIT	Multifrontal, real OOC (if compiled in)	Speed alternative for modal; +313 MB memory at 200×200×2
CHOLMOD supernodal	SuiteSparse	Supernodal, uses BLAS-3 heavily	−33 MB at K_ff stage, 3× slower; +8 MB on full pipeline
CHOLMOD simplicial	SuiteSparse	Scalar Cholesky; very low memory per the docs	Pathologically slow at scale (19× factor) and denser factor
STRUMPACK exact	LBNL	Multifrontal with optional HSS compression	+1 GB vs Pardiso; defeated before HSS even kicks in
STRUMPACK HSS	LBNL	2–4× factor compression on elliptic problems (theoretical)	HSS ratio on 3D hex elasticity is negative — more memory AND lower accuracy
Eigen SimplicialLDLT	Guennebaud et al.	Header-only, no external deps	Slowest; disqualified on both axes
SPRAL SSIDS	STFC UK	Modern multithreaded, low-memory footprint per the literature	Scaffold built; runtime `ssids_factor` returns `flag = -53` regardless of options. Needs deeper debugging.
PaStiX	INRIA	Static scheduling, Scotch ordering	Not on NixOS; not attempted.
HSL MA87	Harwell / STFC	Designed for minimum memory on SPD Cholesky	Academic license; not attempted.

The NixOS OpenBLAS 0.3.30 bug #

Any non-Pardiso solver that performs supernodal-style dense LAPACK kernels from inside its own elimination tree immediately breaks when linked against NixOS’s openblas-0.3.30. The symptom is a cascading series of xerbla-printed errors followed by either a stack-smash detection or a segfault:

** On entry to DGETRF parameter number  4 had an illegal value
** On entry to DPOTRF parameter number  4 had an illegal value
** On entry to DTRSM  parameter number 11 had an illegal value
** On entry to DGEMM  parameter number 13 had an illegal value
*** stack smashing detected ***: terminated

This was reproduced on three separate solvers (STRUMPACK 7.2 / STRUMPACK 8.0 / CHOLMOD supernodal / MUMPS) — all failing with the same pattern. ulimit -s unlimited, OMP_NUM_THREADS=1, and GCC-only (no Clang/libomp mixing) did not help.

Workaround: LD_PRELOAD=libmkl_rt.so.2 before importing the affected solver. MKL’s LAPACK handles the call patterns that NixOS’s OpenBLAS 0.3.30 chokes on. Once preloaded, every tested solver runs cleanly with residuals in the 1e-10 to 1e-15 band.

Root-cause-chasing with AddressSanitizer localised the actual read fault in STRUMPACK to dlaswp_plus() inside OpenBLAS — row-swap during the solve phase dereferences a pivot-index array that dgetrf() returned with undefined contents on degenerate fronts. Patches to STRUMPACK’s BLASLAPACKWrapper.cpp::getrf and DenseMatrix::laswp guard that specific corruption, but the remaining illegal-argument messages from dgemm, dpotrf, and dtrsm indicate the bug lives at several OpenBLAS call sites, not one. Fixing this for the whole ecosystem means either a new OpenBLAS release or a NixOS repackage — filing upstream is the right move.

Benchmark methodology #

Each solver was benchmarked in three regimes on the SOLID185 flat plate that femorph_solver.perf.bench_memory uses (the “perf” bench fixture):

K_ff-only microbench — assemble K_ff and M_ff via Model._bc_reduce_and_release(), factor K_ff once, solve once, capture peak RSS. Measures raw solver overhead without pipeline noise.
Static pipeline — full model.solve(linear_solver=...) with a surface load, from fresh Model.from_grid through displacement return. Measures the solver’s integration with assembly + BC reduce + solve.
Modal pipeline — full model.modal_solve(n_modes=10, eigen_solver='arpack', linear_solver=...) at σ=0 shift-invert. ARPACK drives ~30–50 inner solves per outer iteration, so this is where the solver’s solve-phase overhead accumulates.

Each row runs in a fresh subprocess so ru_maxrss is clean. Thread cap: 4 (MAPDL parity). Environment preloads MKL:

OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 \
  OMP_WAIT_POLICY=PASSIVE KMP_BLOCKTIME=0 \
  LD_PRELOAD=<path>/libmkl_rt.so.2 \
  python perf/bench_solvers.py …

Sizes: 40×40×2, 80×80×2, 120×120×2, 160×160×2, 200×200×2.

Results — K_ff-only microbench #

Factor wall time (seconds):

size	Pardiso	MUMPS	CHOLMOD-sup	STRUMPACK	CHOLMOD-simp	Eigen LDLT
40×40×2	0.32	0.17	0.09	0.38	0.69	0.31
80×80×2	0.63	0.82	0.59	1.09	3.12	5.01
120×120×2	1.31	2.69	1.73	4.43	8.75	32.6
160×160×2	2.23	3.51	7.22	8.04	36.6	49.3
200×200×2	3.48	4.38	10.7	16.8	48.4	127.6

Peak RSS (MB) and Δ vs Pardiso:

size	Pardiso	MUMPS Δ	CHOLMOD-sup Δ	STRUMPACK Δ	CHOLMOD-simp Δ	Eigen Δ
40×40×2	261	−8	−9	+18	−10	−18
80×80×2	504	−24	−14	+117	+14	+58
120×120×2	930	−31	−27	+291	+49	+211
160×160×2	1538	−46	−30	+585	+98	+531
200×200×2	2379	−75	−33	+1021	+223	+937

At this measurement scale MUMPS saves 75 MB over Pardiso at 200×200×2. This turned out to be misleading — see the next section.

Results — full static pipeline #

size	Pardiso wall	Pardiso peak	MUMPS wall	MUMPS peak	Δ wall	Δ peak
80×80×2	0.69 s	580 MB	0.62 s	594 MB	−10%	+14 MB
120×120×2	1.47 s	1091 MB	1.71 s	1121 MB	+16%	+30 MB
160×160×2	2.66 s	1816 MB	3.19 s	1852 MB	+20%	+36 MB
200×200×2	4.28 s	2772 MB	5.46 s	2853 MB	+28%	+81 MB

Displacement fields match bit-for-bit. Pardiso wins both axes on the full static pipeline.

Results — full modal pipeline #

10 modes via ARPACK shift-invert at σ=0:

size	Pardiso wall	Pardiso peak	MUMPS wall	MUMPS peak	Δ wall	Δ peak
80×80×2	1.98 s	476 MB	1.54 s	516 MB	−22%	+40 MB
120×120×2	5.25 s	849 MB	5.03 s	959 MB	−4%	+110 MB
160×160×2	9.37 s	1402 MB	7.91 s	1592 MB	−16%	+190 MB
200×200×2	15.19 s	2138 MB	12.46 s	2451 MB	−18%	+313 MB

MUMPS is consistently 15–20 % faster on modal at the cost of +200–300 MB peak RSS. The K_ff-only microbench’s -75 MB reading does not generalise — MUMPS’s solve-phase workspace accumulates across ARPACK’s ~30 inner solves, more than erasing the single-solve win.

Drop-in parity: modal frequencies from linear_solver="mumps" and linear_solver="pardiso" agree to max relative drift 2.5 × 10⁻¹⁰ across all 10 modes on a 40×40×2 SOLID185 cantilever. Well below the 1e-7 T1-P10 parity target.

STRUMPACK HSS — why the headline feature didn’t land #

STRUMPACK’s HSS (Hierarchically Semi-Separable) compression is why we spent the most effort there. The literature claims 2–4× factor compression for 3D elliptic problems — if that held for SOLID185, STRUMPACK would undercut Pardiso by several hundred MB.

Measured on the same mesh class, compression_tol = 1e-6:

size	HSS 1e-6 peak (MB)	HSS 1e-6 residual	HSS 1e-8 peak (MB)	HSS 1e-8 residual	Pardiso peak (MB)
40×40×2	279	3.9e-10	278	3.9e-10	261
80×80×2	621	3.8e-10	620	3.8e-10	504
120×120×2	1182	4.0e-05	1185	3.0e-09	930
160×160×2	2054	6.3e-08	2056	1.4e-06	1538
200×200×2	3244	5.1e-06	3251	2.5e-06	2380

Two problems stack:

Peak RSS exceeds Pardiso at every size. STRUMPACK’s own working set + the compressed factor stacks larger than Pardiso’s supernodal factor.
Accuracy degrades to 10⁻⁶ at large sizes regardless of the compression_tol setting. For modal ARPACK shift-invert (which needs ~1e-10 convergence at tol=1e-12) this would bleed into the outer eigenvalue accuracy.

SOLID185 is a first-order hex element. Its factor fronts have high off-diagonal rank — the elasticity Green’s function doesn’t produce the low-rank block structure HSS compression exploits. HSS’s sweet spot is elliptic integral operators and smooth-kernel discretisations, not structural FEM on uniform hex meshes.

This isn’t a build or configuration issue. It’s STRUMPACK telling us honestly that HSS has nothing to compress on our mesh class.

CHOLMOD supernodal — memory-competitive, always slower #

CHOLMOD supernodal (via scikit-sparse or a direct nanobind binding) runs once the NixOS OpenBLAS bug is sidestepped with MKL preload. On K_ff-only it saves ~30 MB vs Pardiso at 200×200×2 but takes 3× longer to factor. On the full modal pipeline the advantage disappears entirely.

Useful as a fallback for build environments where MKL isn’t linkable but a working OpenBLAS is — CHOLMOD is less picky about its BLAS than the supernodal-in-Fortran solvers (STRUMPACK, MUMPS).

Eigen SimplicialLDLT — the “no external deps” option #

Eigen’s SimplicialLDLT<SparseMatrix<double>, Lower> is header-only and runs cleanly on any BLAS. At 40×40×2 it’s actually the smallest peak RSS of everything tested (230 MB vs Pardiso’s 261). Above that it loses on both axes — scalar simplicial Cholesky with AMD ordering produces a denser factor than supernodal with METIS, and runs single-threaded. At 200×200×2 it’s 176 seconds and 3316 MB — 50× slower than Pardiso and using 937 MB more memory.

Useful only as a last-resort fallback for environments where nothing else links. Good to know it’s correct; not something to use in anger.

Honest recommendation #

Stay on Pardiso. The night’s investigation didn’t find a solver that’s strictly better than Pardiso for femorph-solver’s workload on this class of mesh. The full pipeline numbers are the ones that matter — not K_ff-only microbenches — and on those Pardiso is the memory leader by 80–300 MB at 200×200×2.

Use MUMPS if you need 15–20 % faster modal and have RAM to spare. It’s a drop-in via a 60-line MumpsSolver adapter and produces the same frequencies Pardiso does to 2.5 × 10⁻¹⁰.

Don’t bother with STRUMPACK HSS on hex elasticity meshes — the compression ratio is negative and the error is high.

Skip CHOLMOD simplicial and Eigen LDLT for any mesh beyond “trivial” — they’re either slow, memory-hungry, or both.

Build environment notes (for NixOS / nix-like systems)#

gfortran is often not on PATH when GCC is — export /nix/store/.../gfortran-wrapper-*/bin before any build that needs Fortran.
NixOS GCC’s -fstack-protector-strong + -D_FORTIFY_SOURCE=2 default mask VLA bugs in STRUMPACK’s multifrontal path. Strip with -U_FORTIFY_SOURCE -fno-stack-protector to surface the real failure at its source (not the canary-check).
Mixing Clang libomp and GCC libgomp races over thread state. Pick one runtime end-to-end; on NixOS that means building C++ with CC=gcc CXX=g++ since gfortran’s libgomp is the default and OpenBLAS links against it.
NixOS OpenBLAS 0.3.30 has a real bug affecting sparse-solver supernodal kernels. Workaround: LD_PRELOAD=libmkl_rt.so.2.

SPRAL SSIDS — fast but memory-heavy on hex elasticity #

SPRAL was the single remaining candidate that might beat Pardiso on memory. Its published benchmarks show it matching HSL MA87 and beating MUMPS on several metrics while maintaining lower peak RSS than CHOLMOD on sparse SPD problems.

Getting SPRAL to run: the SSIDS CPU factor requires the OpenMP cancellation feature, which GCC’s libgomp keeps off by default. The solver returns inform%flag = -53 — undocumented numerically but prints SSIDS CPU code requires OMP cancellation to be enabled at print_level=10. Fix: export OMP_CANCELLATION=TRUE before libgomp initialises (i.e. before numpy / scipy import). Setting it from Python via os.environ post-import is a no-op because libgomp has already cached the runtime. Subprocess launchers need it in the env dict passed to Popen.

Measured results on the SOLID185 modal pipeline (4 threads, MKL preload, posdef=true):

size	Pardiso wall	Pardiso peak	SPRAL wall	SPRAL peak	Δ peak vs Pardiso
80×80×2	1.98 s	476 MB	1.88 s	752 MB	+276 MB
120×120×2	5.25 s	849 MB	4.64 s	1450 MB	+601 MB
160×160×2	9.37 s	1402 MB	8.46 s	2475 MB	+1073 MB
200×200×2	15.19 s	2138 MB	13.66 s	3694 MB	+1556 MB

SPRAL is the fastest solver at every mesh size — beats both Pardiso and MUMPS on wall time. But the peak-RSS cost is catastrophic: +276 MB at 80×80×2 widens to +1556 MB (+73%) at 200×200×2. SPRAL’s multifrontal factor allocates a much larger working set than Pardiso’s supernodal packing; the published memory advantages don’t hold on hex elasticity stiffness matrices.

So despite SPRAL being the last plausible memory candidate, it doesn’t deliver memory savings here either. It joins MUMPS as a speed alternative — if the user values modal wall time and has RAM headroom, SPRAL is 8–10% faster than Pardiso across sizes. If memory is binding, stay on Pardiso.

Three-way speed comparison at 200×200×2:

solver	wall time	peak RSS	summary
Pardiso	15.19 s	2138 MB	baseline, memory leader
MUMPS	12.46 s (-18%)	2451 MB (+313 MB)	balanced speed / memory tradeoff
SPRAL	13.66 s (-10%)	3694 MB (+1556 MB)	fastest small-to-medium sizes, worst memory

Final recommendation unchanged: Pardiso stays the default. The two viable alternatives are both speed wins at memory cost:

MUMPS for a moderate speed boost (−15–20%) at moderate memory cost (+200–300 MB) — best general-purpose alternative.
SPRAL for the maximum speed at any size (−8–20%) at much higher memory cost (+300–1500 MB) — only for memory-abundant setups.

SPRAL integration adapter lives at mapdl-re/pyspral/src/pyspral/femorph_integration.py. Drop-in via:

_REGISTRY["spral"] = SpralSolver
model.modal_solve(linear_solver="spral")

Requires OMP_CANCELLATION=TRUE in the environment and LD_PRELOAD=libmkl_rt.so.2 to sidestep the NixOS OpenBLAS 0.3.30 cascade.

Direct MKL Pardiso with working OOC — the biggest memory win #

pypardiso silently swallows iparm(60) (OOC mode selector). Call set_iparm(60, 2) through pypardiso and it warns 2 is no input iparm; the readback after factor shows 0. The assumption that this was a MKL limitation on the pip-wheel build turned out to be wrong — it’s pypardiso’s Python layer filtering the iparm dict.

Agent-3’s DirectMklPardisoSolver (in-tree at femorph_solver.solvers.linear._mkl_pardiso, PR #126) calls pardiso_() via ctypes directly. Bypassing pypardiso unlocks the full iparm array — including iparm(60)=2 for out-of-core factor. The pip-wheel MKL’s OOC implementation works as long as MKL_PARDISO_OOC_PATH points at a writable directory.

Design highlights of the in-tree solver:

Reuses pypardiso’s already-loaded MKL handle via pypardiso.PyPardisoSolver(mtype=11).libmkl — no duplicate dlopen, same resolution heuristic.
iparm(34)=1 — 0-based CSR indices eliminate the +1 rewrite copies pypardiso does on every solve (~65 MB saved per solve at 150×150×2).
iparm(23)=1 + iparm(24)=2 — “improved 2-level parallel” factorisation + 2-level solve, roughly halves solve wall time vs the classical path.
Phase 12 in one call (analyse + factorise).
Phase -1 on __del__ — same MKL factor-leak fix as PR #107.

OOC bug uncovered during this investigation: on meshes at or above 160×160×2 (n_dof ≥ 231 840), ooc=True returned MKL error -2 (insufficient memory) at phase=12. Root cause: the iparm(23)=1 + iparm(24)=2 fast-parallel allocations are incompatible with iparm(60)=2 OOC on this MKL build — the per-thread scratch sized for the whole frontal matrix is what OOC refuses to spill.

Fix (local branch fix/mkl-pardiso-ooc-disable-fast-parallel): drop iparm(23)/(24) to the classical parallel path when OOC is requested. Costs ~2× solve wall time in exchange for OOC actually working at scale. With the patch applied:

Agent-3 `DirectMklPardisoSolver` — in-core vs OOC#
size	in-core peak	OOC peak	Δ peak	in-core factor	OOC factor
80×80×2	505 MB	511 MB	+6 MB	0.38 s	0.67 s
120×120×2	920 MB	935 MB	+15 MB	0.82 s	1.26 s
160×160×2	1526 MB	1265 MB	−261 MB (−17%)	1.73 s	2.49 s
200×200×2	2371 MB	1604 MB	−767 MB (−32%)	2.50 s	4.12 s

This is the single biggest SPD-solver memory reduction measured across the entire investigation. The gap widens with problem size — at small meshes OOC equals in-core because the factor fits in MAX_CORE_SIZE; at large meshes OOC streams the overflow to disk, capping peak RSS near MAX_CORE_SIZE + K / M / eigensolve workspace.

Identical residual (4.3e-11 on both at 200×200×2), 65% longer factor, solve time 2-3× slower because disk IO is in the critical path.

Enablement:

# Required env vars — must be set before first MKL call.
export MKL_PARDISO_OOC_PATH=/tmp/mkl_pardiso_ooc
export MKL_PARDISO_OOC_MAX_CORE_SIZE=500

# In Python
from femorph_solver.solvers.linear._mkl_pardiso import DirectMklPardisoSolver
lu = DirectMklPardisoSolver(K_ff, assume_spd=True, ooc=True)

Or via the registry when the fix lands:

model.modal_solve(linear_solver="mkl_direct_ooc")  # future registration

Tradeoff context at 200×200×2:

Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor
Pardiso (DirectMklPardiso, in-core): 2371 MB peak, 2.50 s factor
Pardiso (DirectMklPardiso, OOC): 1604 MB peak, 4.12 s factor
MUMPS: 2304 MB peak, 4.38 s factor
SPRAL: 3694 MB peak, 13.66 s factor
Eigen CG+IC0: 1091 MB peak at 160×160×2, 44× slower

Direct-MKL OOC delivers the largest peak-RSS reduction of any direct solver tested while remaining competitive on wall time (~2× slower than in-core Pardiso, still faster than MUMPS). This is the recommended memory-constrained path for femorph-solver.

Iterative fallback — Eigen CG with IncompleteCholesky #

After exhausting direct solvers, the final angle was preconditioned conjugate gradient with incomplete Cholesky (IC(0)) — pure iterative, no factor to store. Memory is just the matrix, the IC preconditioner (roughly the size of triu K), and a few CG basis vectors. If it converges in reasonable time, this is the single lowest-memory option for SPD.

Measured on the K_ff-only pipeline (tol = 1e-6, max 100 000 iterations):

size	n_dof	CG wall	CG iters	CG peak	Pardiso peak
40×40×2	14 760	1.35 s	1 468	211 MB	261 MB (-50)
80×80×2	58 320	17.8 s	3 368	387 MB	504 MB (-117)
120×120×2	130 680	43.6 s	3 325	683 MB	930 MB (-247)
160×160×2	231 840	97.9 s	4 242	1091 MB	1538 MB (-447)

At 160×160×2 CG+IC(0) uses 29% less memory than Pardiso. The gap widens with problem size — this is the only candidate that actually shrinks faster than Pardiso in absolute terms as the mesh grows.

The fatal catches though:

44× slower than Pardiso at 160×160×2 (97.9 s vs 2.23 s). The iteration count scales with \(\sqrt{\mathrm{cond}(K)}\), which for structured hex elasticity grows like \(h^{-1}\). Not scalable.
Residual floor at 1e-6 — the IC(0) preconditioner isn’t strong enough to drive CG below 1e-6 in any reasonable iteration budget on elasticity stiffness. ARPACK shift-invert at tolerance 1e-12 needs the inner solver at ~1e-10 so modal is out of reach; a 1e-6 residual is fine for static solves where the user accepts single-precision-ish accuracy.

Use cases:

Static solve on a severely memory-constrained machine where shaving 29% off peak RSS matters more than a 40× speed hit and the user accepts 1e-6 accuracy. Not the default — a niche override.
Not viable for modal — the residual ceiling would bleed into ARPACK’s eigenvalue accuracy.

This is genuinely the only memory-saving option we found, and the tradeoff is steep. Exposed via pyeigensparse.CGICSolver; bindings live at mapdl-re/pyeigensparse/.

Direct MKL Pardiso wrapper — unlocking OOC #

pypardiso silently swallows iparm(60) (OOC mode selector) — when you call set_iparm(60, 2) it warns 2 is no input iparm and the readback after factor shows 0. We suspected this was a MKL limitation in the pip-wheel build.

It wasn’t. It was pypardiso’s Python layer. A 300-line ctypes wrapper calling pardiso_() directly (no pypardiso) honours every iparm we set — including iparm(60)=2 (full out-of-core). The pip-wheel MKL’s OOC works as long as MKL_PARDISO_OOC_PATH points at a writable directory.

Package: mapdl-re/pymklpardiso/ — direct pardiso_() wrapper via nanobind.

Measured on the SOLID185 K_ff-only bench (4 threads, MAX_CORE_SIZE=500 MB):

size	in-core peak	OOC peak	Δ peak	in-core factor	OOC factor
80×80×2	487 MB	487 MB	0 (fits in core)	0.30 s	0.58 s
120×120×2	906 MB	911 MB	+5 MB	0.74 s	1.28 s
160×160×2	1497 MB	1214 MB	−283 MB (−19%)	1.41 s	2.62 s
200×200×2	2338 MB	1540 MB	−798 MB (−34%)	2.43 s	4.24 s

This is the biggest memory win of the investigation. The gap grows with problem size:

Small meshes (<500 MB factor): OOC equals in-core — the core budget is big enough to hold everything, MKL never spills.
Large meshes (>500 MB factor): OOC streams the overflow to disk, capping peak RSS near MAX_CORE_SIZE + K / M / eigensolve workspace.

Residual is identical (4.3e-11 on both at 200×200×2) — no accuracy cost. Factor time is ~70% slower because disk IO is in the critical path; solve time is ~6× slower for the same reason (each back-solve touches the on-disk factor blocks).

Tradeoff calculus at 200×200×2:

Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor.
Pardiso (our direct wrapper, OOC): 1540 MB peak, 4.24 s factor.
MUMPS: 2304 MB peak, 4.38 s factor.
SPRAL: 3694 MB peak, 13.66 s factor.

The direct-MKL OOC path beats every other solver tested on peak RSS while remaining competitive on wall time.

Integration: pymklpardiso/src/pymklpardiso/femorph_integration.py exposes two adapter classes:

MklPardisoOOCSolver — iparm(60)=2, name mklp_ooc. Available only when MKL_PARDISO_OOC_PATH is set.
MklPardisoInCoreSolver — iparm(60)=0, name mklp_direct. Useful for A/B benches against pypardiso without OOC noise.

Users opt in via:

from pymklpardiso.femorph_integration import MklPardisoOOCSolver
_REGISTRY["mklp_ooc"] = MklPardisoOOCSolver
# Set MKL_PARDISO_OOC_PATH + MKL_PARDISO_OOC_MAX_CORE_SIZE in env
model.modal_solve(linear_solver="mklp_ooc")

The two required env variables mean available() returns False by default — users explicitly opt in by exporting them.

Note on the pypardiso vs direct-MKL behavioural difference: we don’t yet know exactly what pypardiso does to iparm(60), only that the readback differs. Worth filing an upstream issue with pypardiso — the current behaviour silently downgrades OOC without telling the caller. Our wrapper is the workaround.

Full candidate-search summary #

To close the loop on what was investigated:

Solver	Built?	Works?	Final position vs Pardiso on SOLID185 modal
MKL Pardiso	N/A (shipped)	✅	memory leader, reference
MUMPS	✅	✅	speed alternative (−18%) at moderate memory cost (+313 MB)
SPRAL SSIDS	✅	✅	fastest (−8 to −10%) but massive memory cost (+1556 MB at 200×200×2)
CHOLMOD supernodal	✅	✅ (w/ MKL preload)	−33 MB on K_ff-only, 3× slower; flat on full pipeline
STRUMPACK (exact)	✅	✅ (w/ MKL preload + getrf patch)	loses on both axes
STRUMPACK (HSS)	✅	✅	worse memory + degraded accuracy on hex elasticity
Eigen SimplicialLDLT	✅	✅	disqualified; scalar simplicial loses everywhere
Eigen CG + IC(0)	✅	✅	lowest memory (-447 MB at 160×160×2); 44× slower, 1e-6 residual ceiling
CHOLMOD simplicial	✅	✅	slow and dense; no use case
SuperLU 7.0.1 (NixOS)	❌	—	LU (not SPD-exploiting); skip — won’t beat Cholesky solvers
PaStiX	❌	—	not on NixOS; build-from-source deferred
HSL MA87 / MA97	❌	—	academic license not obtained
Schenk academic PARDISO 7	❌	—	different-from-Intel fork; download + build deferred
Ginkgo / cuDSS / Amesos2	❌	—	not on NixOS; heavy build cost; deferred

The candidates left in “not attempted” are all multi-hour builds for a speculative gain — HSL MA87 is the most promising (specifically designed for low-memory SPD Cholesky and used alongside SPRAL by STFC) but requires registering for the academic license. PaStiX is a plausible second option but its INRIA build system + Scotch + BLAS threading requires serious packaging work on NixOS.

Bottom line after the full sweep:

``DirectMklPardisoSolver(ooc=True)`` is the memory winner — -767 MB (-32 %) at 200×200×2 with the iparm(23)/(24)- disable-when-OOC patch (fix/mkl-pardiso-ooc-disable-fast-parallel branch). Identical accuracy, ~2× longer solve. This is the recommended path when peak RSS is binding.
Keep pypardiso’s ``Pardiso`` as the in-core default — best all-around speed/memory for users who don’t enable OOC.
Expose MUMPS as ``linear_solver=”mumps”`` — clean 15–20% modal speedup at +300 MB peak RSS cost. Good for speed-first, RAM-rich users.
Expose SPRAL as ``linear_solver=”spral”`` — fastest direct solver on small/medium sizes, +1.5 GB memory cost at 200×200×2.
Expose Eigen CG+IC0 as ``linear_solver=”cg_ic0”`` — alternate memory-reduction path for static only; 1e-6 residual ceiling, 44× slower than Pardiso. Worth having for memory-constrained static workflows that accept single-precision accuracy.
STRUMPACK / CHOLMOD simplicial / Eigen LDLT — don’t land; they lose on both axes of the full-pipeline bench.
HSL MA87 + PaStiX — follow-up candidates if we hit the MKL OOC floor and need to push further. Both require external source access (HSL has academic-license gate, PaStiX’s INRIA GitLab host wasn’t in the sandbox’s trusted-SCM list and would need explicit permission to clone).

Supporting artefacts #

The raw bench data + scripts are under mapdl-re/ (local-only):

pycholmod/bench_vs_pardiso.py — cross-solver K_ff-only bench
pycholmod/bench_strumpack_mumps.py — STRUMPACK + MUMPS additions
pycholmod/bench_strumpack_hss.py — HSS tolerance sweep
pymumps/bench_integrated_static.py — full static pipeline
pymumps/bench_integrated_modal.py — full modal pipeline
pymumps/test_integrated_modal.py — drop-in parity verification
pymumps/src/pymumps/femorph_integration.py — the MumpsSolver adapter itself
SOLVER_COMPARISON.md at the workspace root — the summary that this page was distilled from