Solver exploration (extras)
===========================

This page documents an in-depth investigation into sparse SPD direct
solvers as alternatives to MKL Pardiso for femorph-solver's static and
modal pipelines.  It's an "extras" page in the sense that none of it
changes the recommended default — that remains Pardiso — but it closes
a pile of real questions about why, and points at what to try if your
constraints differ from the common case.

.. contents::
   :local:
   :depth: 2

Why we looked
-------------

After landing the low-hanging memory-reduction PRs (grid-as-sole-truth,
triu K/M assembly, shared sparsity pattern, pypardiso's ``A.copy()``
elision, factor-release on GC), peak RSS on a ``200×200×2`` SOLID185
modal solve sat at ~2130 MB — about 2× MAPDL's in-core.  The remaining
gap is the Pardiso Cholesky factor itself (iparm(17) = 1.14 GB peak
during factorisation, 0.39 GB permanent).  That's a hard floor for a
direct solver on this mesh topology.

Two plausible ways to move past that floor:

1. A solver that produces a smaller factor (MUMPS, CHOLMOD, STRUMPACK
   with HSS compression, SPRAL).
2. A solver with working out-of-core (OOC) that caps peak RSS
   regardless of problem size.

MKL Pardiso appeared to ignore ``iparm(60)`` OOC on pypardiso's
bundled MKL build — ``set_iparm(60, 2)`` warns "no input iparm" and
the readback is 0.  The investigation below started as a systematic
sweep of #1 on that assumption.

**Plot twist**: it turned out pypardiso's Python layer was filtering
the iparm dict — not MKL itself.  A direct ctypes wrapper (agent-3's
:class:`DirectMklPardisoSolver`, PR #126) honours ``iparm(60)=2``, and
the pip-wheel MKL's OOC works.  With a small fix to disable incompatible
fast-parallel flags during OOC, **we land −767 MB peak (−32%) at
200×200×2** — the biggest memory win of the investigation.  See the
"Direct MKL Pardiso with working OOC" section below for the details.

The #1 sweep (candidate direct solvers) still produced useful data —
it's how we ruled in MUMPS as a speed alternative and ruled out
STRUMPACK HSS / scalar simplicial Cholesky.

Candidate solvers considered
----------------------------

All are sparse direct SPD Cholesky-class solvers that could plausibly
match or beat MKL Pardiso:

.. list-table::
   :widths: 18 12 50 20
   :header-rows: 1

   * - Solver
     - Origin
     - Selling point
     - Final verdict
   * - **MKL Pardiso**
     - Intel
     - Supernodal, multithreaded.  Production default.
     - Memory leader on realistic pipelines
   * - **MUMPS**
     - CERFACS / IRIT
     - Multifrontal, real OOC (if compiled in)
     - Speed alternative for modal; +313 MB memory at 200×200×2
   * - **CHOLMOD supernodal**
     - SuiteSparse
     - Supernodal, uses BLAS-3 heavily
     - −33 MB at K_ff stage, 3× slower; +8 MB on full pipeline
   * - **CHOLMOD simplicial**
     - SuiteSparse
     - Scalar Cholesky; very low memory per the docs
     - Pathologically slow at scale (19× factor) and denser factor
   * - **STRUMPACK exact**
     - LBNL
     - Multifrontal with optional HSS compression
     - +1 GB vs Pardiso; defeated before HSS even kicks in
   * - **STRUMPACK HSS**
     - LBNL
     - 2–4× factor compression on elliptic problems (theoretical)
     - HSS ratio on 3D hex elasticity is negative — more memory AND lower accuracy
   * - **Eigen SimplicialLDLT**
     - Guennebaud et al.
     - Header-only, no external deps
     - Slowest; disqualified on both axes
   * - **SPRAL SSIDS**
     - STFC UK
     - Modern multithreaded, low-memory footprint per the literature
     - Scaffold built; runtime ``ssids_factor`` returns ``flag = -53`` regardless of options.  Needs deeper debugging.
   * - **PaStiX**
     - INRIA
     - Static scheduling, Scotch ordering
     - Not on NixOS; not attempted.
   * - **HSL MA87**
     - Harwell / STFC
     - Designed for minimum memory on SPD Cholesky
     - Academic license; not attempted.

The NixOS OpenBLAS 0.3.30 bug
-----------------------------

Any non-Pardiso solver that performs supernodal-style dense LAPACK
kernels from inside its own elimination tree immediately breaks when
linked against NixOS's ``openblas-0.3.30``.  The symptom is a
cascading series of ``xerbla``-printed errors followed by either a
stack-smash detection or a segfault:

.. code-block:: text

    ** On entry to DGETRF parameter number  4 had an illegal value
    ** On entry to DPOTRF parameter number  4 had an illegal value
    ** On entry to DTRSM  parameter number 11 had an illegal value
    ** On entry to DGEMM  parameter number 13 had an illegal value
    *** stack smashing detected ***: terminated

This was reproduced on **three separate solvers** (STRUMPACK 7.2 /
STRUMPACK 8.0 / CHOLMOD supernodal / MUMPS) — all failing with the
same pattern.  ``ulimit -s unlimited``, ``OMP_NUM_THREADS=1``, and
GCC-only (no Clang/libomp mixing) did not help.

**Workaround**: ``LD_PRELOAD=libmkl_rt.so.2`` before importing the
affected solver.  MKL's LAPACK handles the call patterns that NixOS's
OpenBLAS 0.3.30 chokes on.  Once preloaded, every tested solver runs
cleanly with residuals in the 1e-10 to 1e-15 band.

Root-cause-chasing with AddressSanitizer localised the actual read
fault in STRUMPACK to :func:`dlaswp_plus` inside OpenBLAS — row-swap
during the solve phase dereferences a pivot-index array that
:func:`dgetrf` returned with undefined contents on degenerate
fronts.  Patches to STRUMPACK's ``BLASLAPACKWrapper.cpp::getrf`` and
``DenseMatrix::laswp`` guard that specific corruption, but the
remaining illegal-argument messages from ``dgemm``, ``dpotrf``, and
``dtrsm`` indicate the bug lives at several OpenBLAS call sites, not
one.  Fixing this for the whole ecosystem means either a new OpenBLAS
release or a NixOS repackage — filing upstream is the right move.

Benchmark methodology
---------------------

Each solver was benchmarked in three regimes on the SOLID185 flat
plate that :mod:`femorph_solver.perf.bench_memory` uses (the "perf" bench
fixture):

1. **K_ff-only microbench** — assemble ``K_ff`` and ``M_ff`` via
   :meth:`Model._bc_reduce_and_release`, factor ``K_ff`` once, solve
   once, capture peak RSS.  Measures raw solver overhead without
   pipeline noise.
2. **Static pipeline** — full ``model.solve(linear_solver=...)``
   with a surface load, from fresh ``Model.from_grid`` through
   displacement return.  Measures the solver's integration with
   assembly + BC reduce + solve.
3. **Modal pipeline** — full ``model.modal_solve(n_modes=10,
   eigen_solver='arpack', linear_solver=...)`` at σ=0 shift-invert.
   ARPACK drives ~30–50 inner solves per outer iteration, so this
   is where the solver's solve-phase overhead accumulates.

Each row runs in a **fresh subprocess** so ``ru_maxrss`` is clean.
Thread cap: 4 (MAPDL parity).  Environment preloads MKL:

.. code-block:: bash

    OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 \
      OMP_WAIT_POLICY=PASSIVE KMP_BLOCKTIME=0 \
      LD_PRELOAD=<path>/libmkl_rt.so.2 \
      python perf/bench_solvers.py …

Sizes: ``40×40×2``, ``80×80×2``, ``120×120×2``, ``160×160×2``,
``200×200×2``.

Results — K_ff-only microbench
------------------------------

Factor wall time (seconds):

.. list-table::
   :widths: 15 12 12 12 14 14 14
   :header-rows: 1

   * - size
     - Pardiso
     - MUMPS
     - CHOLMOD-sup
     - STRUMPACK
     - CHOLMOD-simp
     - Eigen LDLT
   * - 40×40×2
     - 0.32
     - **0.17**
     - 0.09
     - 0.38
     - 0.69
     - 0.31
   * - 80×80×2
     - **0.63**
     - 0.82
     - 0.59
     - 1.09
     - 3.12
     - 5.01
   * - 120×120×2
     - **1.31**
     - 2.69
     - 1.73
     - 4.43
     - 8.75
     - 32.6
   * - 160×160×2
     - **2.23**
     - 3.51
     - 7.22
     - 8.04
     - 36.6
     - 49.3
   * - 200×200×2
     - **3.48**
     - 4.38
     - 10.7
     - 16.8
     - 48.4
     - 127.6

Peak RSS (MB) and Δ vs Pardiso:

.. list-table::
   :widths: 15 10 12 14 14 14 12
   :header-rows: 1

   * - size
     - Pardiso
     - MUMPS Δ
     - CHOLMOD-sup Δ
     - STRUMPACK Δ
     - CHOLMOD-simp Δ
     - Eigen Δ
   * - 40×40×2
     - 261
     - −8
     - −9
     - +18
     - −10
     - −18
   * - 80×80×2
     - 504
     - −24
     - −14
     - +117
     - +14
     - +58
   * - 120×120×2
     - 930
     - −31
     - −27
     - +291
     - +49
     - +211
   * - 160×160×2
     - 1538
     - −46
     - −30
     - +585
     - +98
     - +531
   * - 200×200×2
     - 2379
     - **−75**
     - −33
     - +1021
     - +223
     - +937

At this measurement scale MUMPS saves 75 MB over Pardiso at
200×200×2.  **This turned out to be misleading** — see the next
section.

Results — full static pipeline
------------------------------

.. list-table::
   :widths: 12 12 12 12 12 12 12
   :header-rows: 1

   * - size
     - Pardiso wall
     - Pardiso peak
     - MUMPS wall
     - MUMPS peak
     - Δ wall
     - Δ peak
   * - 80×80×2
     - 0.69 s
     - 580 MB
     - **0.62 s**
     - 594 MB
     - −10%
     - +14 MB
   * - 120×120×2
     - 1.47 s
     - 1091 MB
     - 1.71 s
     - 1121 MB
     - +16%
     - +30 MB
   * - 160×160×2
     - 2.66 s
     - 1816 MB
     - 3.19 s
     - 1852 MB
     - +20%
     - +36 MB
   * - 200×200×2
     - **4.28 s**
     - **2772 MB**
     - 5.46 s
     - 2853 MB
     - +28%
     - +81 MB

Displacement fields match bit-for-bit.  Pardiso wins both axes on the
full static pipeline.

Results — full modal pipeline
-----------------------------

10 modes via ARPACK shift-invert at σ=0:

.. list-table::
   :widths: 12 12 12 12 12 12 12
   :header-rows: 1

   * - size
     - Pardiso wall
     - Pardiso peak
     - MUMPS wall
     - MUMPS peak
     - Δ wall
     - Δ peak
   * - 80×80×2
     - 1.98 s
     - 476 MB
     - **1.54 s**
     - 516 MB
     - **−22%**
     - +40 MB
   * - 120×120×2
     - 5.25 s
     - 849 MB
     - **5.03 s**
     - 959 MB
     - −4%
     - +110 MB
   * - 160×160×2
     - 9.37 s
     - 1402 MB
     - **7.91 s**
     - 1592 MB
     - **−16%**
     - +190 MB
   * - 200×200×2
     - 15.19 s
     - **2138 MB**
     - **12.46 s**
     - 2451 MB
     - **−18%**
     - **+313 MB**

MUMPS is consistently 15–20 % faster on modal at the cost of
+200–300 MB peak RSS.  The K_ff-only microbench's ``-75 MB`` reading
does not generalise — MUMPS's solve-phase workspace accumulates
across ARPACK's ~30 inner solves, more than erasing the single-solve
win.

Drop-in parity: modal frequencies from ``linear_solver="mumps"`` and
``linear_solver="pardiso"`` agree to **max relative drift 2.5 × 10⁻¹⁰**
across all 10 modes on a 40×40×2 SOLID185 cantilever.  Well below the
1e-7 T1-P10 parity target.

STRUMPACK HSS — why the headline feature didn't land
----------------------------------------------------

STRUMPACK's HSS (Hierarchically Semi-Separable) compression is why we
spent the most effort there.  The literature claims 2–4× factor
compression for 3D elliptic problems — if that held for SOLID185,
STRUMPACK would undercut Pardiso by several hundred MB.

Measured on the same mesh class, ``compression_tol = 1e-6``:

.. list-table::
   :widths: 15 15 15 15 15 15
   :header-rows: 1

   * - size
     - HSS 1e-6 peak (MB)
     - HSS 1e-6 residual
     - HSS 1e-8 peak (MB)
     - HSS 1e-8 residual
     - Pardiso peak (MB)
   * - 40×40×2
     - 279
     - 3.9e-10
     - 278
     - 3.9e-10
     - 261
   * - 80×80×2
     - 621
     - 3.8e-10
     - 620
     - 3.8e-10
     - 504
   * - 120×120×2
     - 1182
     - **4.0e-05**
     - 1185
     - 3.0e-09
     - 930
   * - 160×160×2
     - 2054
     - **6.3e-08**
     - 2056
     - **1.4e-06**
     - 1538
   * - 200×200×2
     - 3244
     - **5.1e-06**
     - 3251
     - **2.5e-06**
     - 2380

Two problems stack:

1. **Peak RSS exceeds Pardiso at every size.**  STRUMPACK's own
   working set + the compressed factor stacks larger than Pardiso's
   supernodal factor.
2. **Accuracy degrades to 10⁻⁶** at large sizes regardless of the
   ``compression_tol`` setting.  For modal ARPACK shift-invert (which
   needs ~1e-10 convergence at ``tol=1e-12``) this would bleed into
   the outer eigenvalue accuracy.

SOLID185 is a first-order hex element.  Its factor fronts have high
off-diagonal rank — the elasticity Green's function doesn't produce
the low-rank block structure HSS compression exploits.  HSS's sweet
spot is elliptic integral operators and smooth-kernel discretisations,
not structural FEM on uniform hex meshes.

This isn't a build or configuration issue.  It's STRUMPACK telling us
honestly that HSS has nothing to compress on our mesh class.

CHOLMOD supernodal — memory-competitive, always slower
------------------------------------------------------

CHOLMOD supernodal (via ``scikit-sparse`` or a direct nanobind
binding) runs once the NixOS OpenBLAS bug is sidestepped with MKL
preload.  On K_ff-only it saves ~30 MB vs Pardiso at 200×200×2 but
takes 3× longer to factor.  On the full modal pipeline the advantage
disappears entirely.

Useful as a fallback for build environments where MKL isn't linkable
but a working OpenBLAS is — CHOLMOD is less picky about its BLAS than
the supernodal-in-Fortran solvers (STRUMPACK, MUMPS).

Eigen SimplicialLDLT — the "no external deps" option
----------------------------------------------------

Eigen's ``SimplicialLDLT<SparseMatrix<double>, Lower>`` is header-only
and runs cleanly on any BLAS.  At 40×40×2 it's actually the smallest
peak RSS of everything tested (230 MB vs Pardiso's 261).  Above that
it loses on both axes — scalar simplicial Cholesky with AMD ordering
produces a denser factor than supernodal with METIS, and runs
single-threaded.  At 200×200×2 it's 176 seconds and 3316 MB — 50×
slower than Pardiso and using 937 MB more memory.

Useful only as a last-resort fallback for environments where nothing
else links.  Good to know it's correct; not something to use in
anger.

Honest recommendation
---------------------

**Stay on Pardiso.**  The night's investigation didn't find a
solver that's strictly better than Pardiso for femorph-solver's
workload on this class of mesh.  The full pipeline numbers are the
ones that matter — not K_ff-only microbenches — and on those Pardiso
is the memory leader by 80–300 MB at 200×200×2.

**Use MUMPS if you need 15–20 % faster modal and have RAM to
spare.**  It's a drop-in via a 60-line ``MumpsSolver`` adapter and
produces the same frequencies Pardiso does to 2.5 × 10⁻¹⁰.

**Don't bother with STRUMPACK HSS** on hex elasticity meshes — the
compression ratio is negative and the error is high.

**Skip CHOLMOD simplicial and Eigen LDLT** for any mesh beyond
"trivial" — they're either slow, memory-hungry, or both.

Build environment notes (for NixOS / nix-like systems)
------------------------------------------------------

* ``gfortran`` is often not on PATH when GCC is — export
  ``/nix/store/.../gfortran-wrapper-*/bin`` before any build that
  needs Fortran.
* NixOS GCC's ``-fstack-protector-strong`` + ``-D_FORTIFY_SOURCE=2``
  default mask VLA bugs in STRUMPACK's multifrontal path.  Strip
  with ``-U_FORTIFY_SOURCE -fno-stack-protector`` to surface the
  real failure at its source (not the canary-check).
* Mixing Clang ``libomp`` and GCC ``libgomp`` races over thread
  state.  Pick one runtime end-to-end; on NixOS that means building
  C++ with ``CC=gcc CXX=g++`` since gfortran's ``libgomp`` is the
  default and OpenBLAS links against it.
* NixOS OpenBLAS 0.3.30 has a real bug affecting sparse-solver
  supernodal kernels.  Workaround: ``LD_PRELOAD=libmkl_rt.so.2``.

SPRAL SSIDS — fast but memory-heavy on hex elasticity
-----------------------------------------------------

SPRAL was the single remaining candidate that might beat Pardiso on
memory.  Its published benchmarks show it matching HSL MA87 and
beating MUMPS on several metrics while maintaining lower peak RSS
than CHOLMOD on sparse SPD problems.

**Getting SPRAL to run**: the SSIDS CPU factor requires the OpenMP
cancellation feature, which GCC's libgomp keeps off by default.  The
solver returns ``inform%flag = -53`` — undocumented numerically but
prints ``SSIDS CPU code requires OMP cancellation to be enabled`` at
``print_level=10``.  Fix: export ``OMP_CANCELLATION=TRUE`` **before
libgomp initialises** (i.e. before numpy / scipy import).  Setting it
from Python via ``os.environ`` post-import is a no-op because libgomp
has already cached the runtime.  Subprocess launchers need it in the
env dict passed to ``Popen``.

**Measured results on the SOLID185 modal pipeline** (4 threads, MKL
preload, posdef=true):

.. list-table::
   :widths: 15 15 15 15 15 15
   :header-rows: 1

   * - size
     - Pardiso wall
     - Pardiso peak
     - SPRAL wall
     - SPRAL peak
     - Δ peak vs Pardiso
   * - 80×80×2
     - 1.98 s
     - 476 MB
     - **1.88 s**
     - 752 MB
     - +276 MB
   * - 120×120×2
     - 5.25 s
     - 849 MB
     - **4.64 s**
     - 1450 MB
     - +601 MB
   * - 160×160×2
     - 9.37 s
     - 1402 MB
     - **8.46 s**
     - 2475 MB
     - +1073 MB
   * - 200×200×2
     - 15.19 s
     - **2138 MB**
     - **13.66 s**
     - 3694 MB
     - **+1556 MB**

**SPRAL is the fastest solver at every mesh size** — beats both
Pardiso and MUMPS on wall time.  But the peak-RSS cost is
catastrophic: +276 MB at 80×80×2 widens to +1556 MB (+73%) at
200×200×2.  SPRAL's multifrontal factor allocates a much larger
working set than Pardiso's supernodal packing; the published memory
advantages don't hold on hex elasticity stiffness matrices.

So despite SPRAL being the last plausible memory candidate, it
doesn't deliver memory savings here either.  It joins MUMPS as a
**speed alternative** — if the user values modal wall time and has
RAM headroom, SPRAL is 8–10% faster than Pardiso across sizes.  If
memory is binding, stay on Pardiso.

Three-way speed comparison at 200×200×2:

.. list-table::
   :widths: 20 20 20 40
   :header-rows: 1

   * - solver
     - wall time
     - peak RSS
     - summary
   * - Pardiso
     - 15.19 s
     - 2138 MB
     - baseline, memory leader
   * - MUMPS
     - 12.46 s (-18%)
     - 2451 MB (+313 MB)
     - balanced speed / memory tradeoff
   * - SPRAL
     - 13.66 s (-10%)
     - 3694 MB (+1556 MB)
     - fastest small-to-medium sizes, worst memory

Final recommendation unchanged: **Pardiso** stays the default.  The
two viable alternatives are both speed wins at memory cost:

- **MUMPS** for a moderate speed boost (−15–20%) at moderate memory
  cost (+200–300 MB) — best general-purpose alternative.
- **SPRAL** for the maximum speed at any size (−8–20%) at much
  higher memory cost (+300–1500 MB) — only for memory-abundant
  setups.

SPRAL integration adapter lives at
``mapdl-re/pyspral/src/pyspral/femorph_integration.py``.  Drop-in
via::

    _REGISTRY["spral"] = SpralSolver
    model.modal_solve(linear_solver="spral")

Requires ``OMP_CANCELLATION=TRUE`` in the environment and
``LD_PRELOAD=libmkl_rt.so.2`` to sidestep the NixOS OpenBLAS 0.3.30
cascade.

Direct MKL Pardiso with working OOC — the biggest memory win
------------------------------------------------------------

pypardiso silently swallows ``iparm(60)`` (OOC mode selector).  Call
``set_iparm(60, 2)`` through pypardiso and it warns ``2 is no input
iparm``; the readback after factor shows 0.  The assumption that
this was a MKL limitation on the pip-wheel build turned out to be
wrong — it's pypardiso's Python layer filtering the iparm dict.

Agent-3's ``DirectMklPardisoSolver`` (in-tree at
:mod:`femorph_solver.solvers.linear._mkl_pardiso`, PR #126) calls
``pardiso_()`` via ctypes directly.  Bypassing pypardiso unlocks the
full iparm array — including ``iparm(60)=2`` for out-of-core factor.
The pip-wheel MKL's OOC implementation works as long as
``MKL_PARDISO_OOC_PATH`` points at a writable directory.

Design highlights of the in-tree solver:

* Reuses pypardiso's already-loaded MKL handle via
  ``pypardiso.PyPardisoSolver(mtype=11).libmkl`` — no duplicate
  ``dlopen``, same resolution heuristic.
* ``iparm(34)=1`` — 0-based CSR indices eliminate the ``+1``
  rewrite copies pypardiso does on every solve (~65 MB saved per
  solve at 150×150×2).
* ``iparm(23)=1 + iparm(24)=2`` — "improved 2-level parallel"
  factorisation + 2-level solve, roughly halves solve wall time vs
  the classical path.
* Phase 12 in one call (analyse + factorise).
* Phase -1 on ``__del__`` — same MKL factor-leak fix as PR #107.

**OOC bug uncovered during this investigation**: on meshes at or
above 160×160×2 (n_dof ≥ 231 840), ``ooc=True`` returned MKL error
``-2`` (insufficient memory) at phase=12.  Root cause: the
``iparm(23)=1 + iparm(24)=2`` fast-parallel allocations are
incompatible with ``iparm(60)=2`` OOC on this MKL build — the
per-thread scratch sized for the whole frontal matrix is what OOC
refuses to spill.

**Fix** (local branch ``fix/mkl-pardiso-ooc-disable-fast-parallel``):
drop iparm(23)/(24) to the classical parallel path when OOC is
requested.  Costs ~2× solve wall time in exchange for OOC actually
working at scale.  With the patch applied:

.. list-table:: Agent-3 ``DirectMklPardisoSolver`` — in-core vs OOC
   :widths: 15 18 18 20 12 15
   :header-rows: 1

   * - size
     - in-core peak
     - OOC peak
     - Δ peak
     - in-core factor
     - OOC factor
   * - 80×80×2
     - 505 MB
     - 511 MB
     - +6 MB
     - 0.38 s
     - 0.67 s
   * - 120×120×2
     - 920 MB
     - 935 MB
     - +15 MB
     - 0.82 s
     - 1.26 s
   * - 160×160×2
     - 1526 MB
     - **1265 MB**
     - **−261 MB (−17%)**
     - 1.73 s
     - 2.49 s
   * - 200×200×2
     - **2371 MB**
     - **1604 MB**
     - **−767 MB (−32%)**
     - 2.50 s
     - 4.12 s

**This is the single biggest SPD-solver memory reduction measured
across the entire investigation.**  The gap widens with problem
size — at small meshes OOC equals in-core because the factor fits
in ``MAX_CORE_SIZE``; at large meshes OOC streams the overflow to
disk, capping peak RSS near ``MAX_CORE_SIZE`` + K / M / eigensolve
workspace.

Identical residual (4.3e-11 on both at 200×200×2), 65% longer
factor, solve time 2-3× slower because disk IO is in the critical
path.

**Enablement**::

    # Required env vars — must be set before first MKL call.
    export MKL_PARDISO_OOC_PATH=/tmp/mkl_pardiso_ooc
    export MKL_PARDISO_OOC_MAX_CORE_SIZE=500

    # In Python
    from femorph_solver.solvers.linear._mkl_pardiso import DirectMklPardisoSolver
    lu = DirectMklPardisoSolver(K_ff, assume_spd=True, ooc=True)

Or via the registry when the fix lands::

    model.modal_solve(linear_solver="mkl_direct_ooc")  # future registration

**Tradeoff context at 200×200×2**:

- Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor
- Pardiso (DirectMklPardiso, in-core): 2371 MB peak, 2.50 s factor
- **Pardiso (DirectMklPardiso, OOC)**: **1604 MB peak**, 4.12 s factor
- MUMPS: 2304 MB peak, 4.38 s factor
- SPRAL: 3694 MB peak, 13.66 s factor
- Eigen CG+IC0: 1091 MB peak at 160×160×2, 44× slower

Direct-MKL OOC delivers the largest peak-RSS reduction of any direct
solver tested while remaining competitive on wall time (~2× slower
than in-core Pardiso, still faster than MUMPS).  This is the
recommended memory-constrained path for femorph-solver.

Iterative fallback — Eigen CG with IncompleteCholesky
------------------------------------------------------

After exhausting direct solvers, the final angle was **preconditioned
conjugate gradient with incomplete Cholesky** (IC(0)) — pure
iterative, no factor to store.  Memory is just the matrix, the IC
preconditioner (roughly the size of triu K), and a few CG basis
vectors.  If it converges in reasonable time, this is the single
lowest-memory option for SPD.

Measured on the K_ff-only pipeline (tol = 1e-6, max 100 000
iterations):

.. list-table::
   :widths: 15 12 14 14 15 14
   :header-rows: 1

   * - size
     - n_dof
     - CG wall
     - CG iters
     - CG peak
     - Pardiso peak
   * - 40×40×2
     - 14 760
     - 1.35 s
     - 1 468
     - **211 MB**
     - 261 MB (-50)
   * - 80×80×2
     - 58 320
     - 17.8 s
     - 3 368
     - **387 MB**
     - 504 MB (-117)
   * - 120×120×2
     - 130 680
     - 43.6 s
     - 3 325
     - **683 MB**
     - 930 MB (-247)
   * - 160×160×2
     - 231 840
     - 97.9 s
     - 4 242
     - **1091 MB**
     - 1538 MB (**-447**)

At 160×160×2 CG+IC(0) uses **29% less memory** than Pardiso.  The
gap widens with problem size — this is the only candidate that
actually shrinks faster than Pardiso in absolute terms as the mesh
grows.

**The fatal catches** though:

1. **44× slower** than Pardiso at 160×160×2 (97.9 s vs 2.23 s).  The
   iteration count scales with :math:`\sqrt{\mathrm{cond}(K)}`, which
   for structured hex elasticity grows like :math:`h^{-1}`.  Not
   scalable.
2. **Residual floor at 1e-6** — the IC(0) preconditioner isn't
   strong enough to drive CG below 1e-6 in any reasonable
   iteration budget on elasticity stiffness.  ARPACK shift-invert
   at tolerance 1e-12 needs the inner solver at ~1e-10 so modal is
   out of reach; a 1e-6 residual is fine for static solves where the
   user accepts single-precision-ish accuracy.

**Use cases**:

- **Static solve on a severely memory-constrained machine** where
  shaving 29% off peak RSS matters more than a 40× speed hit and
  the user accepts 1e-6 accuracy.  Not the default — a niche
  override.
- **Not viable for modal** — the residual ceiling would bleed into
  ARPACK's eigenvalue accuracy.

This is genuinely the only memory-saving option we found, and the
tradeoff is steep.  Exposed via ``pyeigensparse.CGICSolver``;
bindings live at ``mapdl-re/pyeigensparse/``.

Direct MKL Pardiso wrapper — unlocking OOC
------------------------------------------

pypardiso silently swallows ``iparm(60)`` (OOC mode selector) — when
you call ``set_iparm(60, 2)`` it warns ``2 is no input iparm`` and
the readback after factor shows 0.  We suspected this was a MKL
limitation in the pip-wheel build.

It wasn't.  It was pypardiso's Python layer.  A 300-line ctypes
wrapper calling ``pardiso_()`` directly (no pypardiso) honours every
iparm we set — including ``iparm(60)=2`` (full out-of-core).  The
pip-wheel MKL's OOC works as long as ``MKL_PARDISO_OOC_PATH`` points
at a writable directory.

Package: ``mapdl-re/pymklpardiso/`` — direct ``pardiso_()`` wrapper
via nanobind.

**Measured on the SOLID185 K_ff-only bench (4 threads, MAX_CORE_SIZE=500 MB)**:

.. list-table::
   :widths: 15 18 18 18 15 15
   :header-rows: 1

   * - size
     - in-core peak
     - OOC peak
     - Δ peak
     - in-core factor
     - OOC factor
   * - 80×80×2
     - 487 MB
     - 487 MB
     - 0 (fits in core)
     - 0.30 s
     - 0.58 s
   * - 120×120×2
     - 906 MB
     - 911 MB
     - +5 MB
     - 0.74 s
     - 1.28 s
   * - 160×160×2
     - 1497 MB
     - **1214 MB**
     - **−283 MB (−19%)**
     - 1.41 s
     - 2.62 s
   * - 200×200×2
     - **2338 MB**
     - **1540 MB**
     - **−798 MB (−34%)**
     - 2.43 s
     - 4.24 s

**This is the biggest memory win of the investigation.**  The gap
grows with problem size:

- Small meshes (<500 MB factor): OOC equals in-core — the core
  budget is big enough to hold everything, MKL never spills.
- Large meshes (>500 MB factor): OOC streams the overflow to disk,
  capping peak RSS near ``MAX_CORE_SIZE`` + K / M / eigensolve
  workspace.

Residual is identical (4.3e-11 on both at 200×200×2) — no accuracy
cost.  Factor time is ~70% slower because disk IO is in the critical
path; solve time is ~6× slower for the same reason (each back-solve
touches the on-disk factor blocks).

**Tradeoff calculus at 200×200×2**:

- Pardiso (pypardiso, in-core): 2338 MB peak, 2.43 s factor.
- Pardiso (our direct wrapper, OOC): **1540 MB peak**, 4.24 s factor.
- MUMPS: 2304 MB peak, 4.38 s factor.
- SPRAL: 3694 MB peak, 13.66 s factor.

The direct-MKL OOC path beats every other solver tested on peak RSS
while remaining competitive on wall time.

**Integration**: ``pymklpardiso/src/pymklpardiso/femorph_integration.py``
exposes two adapter classes:

- ``MklPardisoOOCSolver`` — iparm(60)=2, name ``mklp_ooc``.
  Available only when ``MKL_PARDISO_OOC_PATH`` is set.
- ``MklPardisoInCoreSolver`` — iparm(60)=0, name ``mklp_direct``.
  Useful for A/B benches against pypardiso without OOC noise.

Users opt in via::

    from pymklpardiso.femorph_integration import MklPardisoOOCSolver
    _REGISTRY["mklp_ooc"] = MklPardisoOOCSolver
    # Set MKL_PARDISO_OOC_PATH + MKL_PARDISO_OOC_MAX_CORE_SIZE in env
    model.modal_solve(linear_solver="mklp_ooc")

The two required env variables mean ``available()`` returns False by
default — users explicitly opt in by exporting them.

**Note on the pypardiso vs direct-MKL behavioural difference**: we
don't yet know *exactly* what pypardiso does to iparm(60), only that
the readback differs.  Worth filing an upstream issue with pypardiso
— the current behaviour silently downgrades OOC without telling the
caller.  Our wrapper is the workaround.

Full candidate-search summary
-----------------------------

To close the loop on what was investigated:

.. list-table::
   :widths: 20 15 12 53
   :header-rows: 1

   * - Solver
     - Built?
     - Works?
     - Final position vs Pardiso on SOLID185 modal
   * - **MKL Pardiso**
     - N/A (shipped)
     - ✅
     - **memory leader**, reference
   * - **MUMPS**
     - ✅
     - ✅
     - speed alternative (−18%) at moderate memory cost (+313 MB)
   * - **SPRAL SSIDS**
     - ✅
     - ✅
     - fastest (−8 to −10%) but massive memory cost (+1556 MB at 200×200×2)
   * - **CHOLMOD supernodal**
     - ✅
     - ✅ (w/ MKL preload)
     - −33 MB on K_ff-only, 3× slower; flat on full pipeline
   * - **STRUMPACK (exact)**
     - ✅
     - ✅ (w/ MKL preload + getrf patch)
     - loses on both axes
   * - **STRUMPACK (HSS)**
     - ✅
     - ✅
     - worse memory + degraded accuracy on hex elasticity
   * - **Eigen SimplicialLDLT**
     - ✅
     - ✅
     - disqualified; scalar simplicial loses everywhere
   * - **Eigen CG + IC(0)**
     - ✅
     - ✅
     - **lowest memory (-447 MB at 160×160×2)**; 44× slower, 1e-6 residual ceiling
   * - CHOLMOD simplicial
     - ✅
     - ✅
     - slow and dense; no use case
   * - SuperLU 7.0.1 (NixOS)
     - ❌
     - —
     - LU (not SPD-exploiting); skip — won't beat Cholesky solvers
   * - PaStiX
     - ❌
     - —
     - not on NixOS; build-from-source deferred
   * - HSL MA87 / MA97
     - ❌
     - —
     - academic license not obtained
   * - Schenk academic PARDISO 7
     - ❌
     - —
     - different-from-Intel fork; download + build deferred
   * - Ginkgo / cuDSS / Amesos2
     - ❌
     - —
     - not on NixOS; heavy build cost; deferred

The candidates left in "not attempted" are all multi-hour builds for
a speculative gain — HSL MA87 is the most promising (specifically
designed for low-memory SPD Cholesky and used alongside SPRAL by
STFC) but requires registering for the academic license.  PaStiX is
a plausible second option but its INRIA build system + Scotch + BLAS
threading requires serious packaging work on NixOS.

**Bottom line** after the full sweep:

- **``DirectMklPardisoSolver(ooc=True)`` is the memory winner** —
  ``-767 MB`` (``-32 %``) at 200×200×2 with the iparm(23)/(24)-
  disable-when-OOC patch (``fix/mkl-pardiso-ooc-disable-fast-parallel``
  branch).  Identical accuracy, ~2× longer solve.  This is the
  recommended path when peak RSS is binding.
- **Keep pypardiso's ``Pardiso`` as the in-core default** — best
  all-around speed/memory for users who don't enable OOC.
- **Expose MUMPS as ``linear_solver="mumps"``** — clean 15–20% modal
  speedup at +300 MB peak RSS cost.  Good for speed-first, RAM-rich
  users.
- **Expose SPRAL as ``linear_solver="spral"``** — fastest direct
  solver on small/medium sizes, +1.5 GB memory cost at 200×200×2.
- **Expose Eigen CG+IC0 as ``linear_solver="cg_ic0"``** — alternate
  memory-reduction path for static only; 1e-6 residual ceiling,
  44× slower than Pardiso.  Worth having for memory-constrained
  static workflows that accept single-precision accuracy.
- **STRUMPACK / CHOLMOD simplicial / Eigen LDLT** — don't land; they
  lose on both axes of the full-pipeline bench.
- **HSL MA87 + PaStiX** — follow-up candidates if we hit the
  MKL OOC floor and need to push further.  Both require external
  source access (HSL has academic-license gate, PaStiX's INRIA
  GitLab host wasn't in the sandbox's trusted-SCM list and would
  need explicit permission to clone).

Supporting artefacts
--------------------

The raw bench data + scripts are under ``mapdl-re/`` (local-only):

- ``pycholmod/bench_vs_pardiso.py`` — cross-solver K_ff-only bench
- ``pycholmod/bench_strumpack_mumps.py`` — STRUMPACK + MUMPS additions
- ``pycholmod/bench_strumpack_hss.py`` — HSS tolerance sweep
- ``pymumps/bench_integrated_static.py`` — full static pipeline
- ``pymumps/bench_integrated_modal.py`` — full modal pipeline
- ``pymumps/test_integrated_modal.py`` — drop-in parity verification
- ``pymumps/src/pymumps/femorph_integration.py`` — the
  :class:`MumpsSolver` adapter itself
- ``SOLVER_COMPARISON.md`` at the workspace root — the summary that
  this page was distilled from