.. _ooc-limits:

Out-of-core (OOC) Pardiso — limits and tuning
==============================================

``DirectMklPardisoSolver(A, ooc=True)`` switches MKL's sparse
Cholesky / LU factor onto its disk-backed path
(``iparm(60) = 2``) — the factor + its scratch are spilled to a
scratch file and streamed back in chunks at solve time.  This
keeps problems solvable on hosts where the in-core factor would
exhaust RAM, at the cost of **~3.5× wall time** and the
requirement that the MKL OOC knobs be configured correctly.

This page documents what we've tested, where MKL stops
cooperating, and the error-code → fix table.

.. contents::
   :local:
   :depth: 2

Required environment
--------------------

MKL reads two environment variables at Pardiso construction time.
Neither has a runtime-override equivalent inside the library, so
they must be set **before** ``femorph_solver`` imports anything
that touches MKL — in practice that means before any
``import femorph_solver`` line that resolves to the C extension.

``MKL_PARDISO_OOC_PATH``
   Directory (or path prefix) where MKL writes the factor spill
   files.  Must be writable.  Each factor creates files named
   ``pardiso_ooc_*`` in that directory.  Default ``.`` — the
   current working directory — which is rarely what you want.

``MKL_PARDISO_OOC_MAX_CORE_SIZE``
   In-memory budget (MB) for the current frontal block.  Below
   MKL's floor (~500 MB in 2024+ builds) any non-trivial 3D
   front returns error ``-9``.  A safe heuristic is half the
   estimated factor size; ``perf/bench_ooc_vs_incore.py`` auto-
   picks this from the in-core pass's reported ``factor_nnz``.

A third variable, ``MKL_PARDISO_OOC_KEEP_FILES``, is honoured
by MKL but not used by ``DirectMklPardisoSolver`` — the solver
always unlinks its spill files on teardown (phase ``-1``).

Error-code table
----------------

``DirectMklPardisoSolver`` maps every documented MKL Pardiso
negative return to a human hint before raising
``RuntimeError``.  The same hint text appears in
:data:`~femorph_solver.solvers.linear._mkl_pardiso._MKL_ERROR_MESSAGES`
for callers that want to pattern-match programmatically.

.. list-table::
   :widths: 8 30 40
   :header-rows: 1

   * - Code
     - Condition
     - Usual fix
   * - ``-1``
     - Input inconsistent.
     - Check ``indptr`` / ``indices`` dtype (must be int32), sorted
       indices, and the CSR shape.
   * - ``-2``
     - Not enough memory for the in-core factor.
     - Re-run with ``ooc=True`` or on a host with more RAM.
   * - ``-3``
     - Reordering problem.
     - Check METIS availability; ``iparm[1]`` default is METIS but
       a corrupt install can surface here.
   * - ``-4``
     - Zero pivot during numerical factorisation.
     - The matrix has a zero (or machine-zero) pivot at
       ``iparm[19] - 1``.  Most commonly a BC misspec that leaves
       a row with no stiffness; inspect the row index MKL reports.
   * - ``-8``
     - 32-bit integer overflow.
     - The matrix CSR exceeds int32 indexing.  Not resolvable with
       Pardiso's current binding; switch to a 64-bit-integer
       backend (MUMPS int64 build) or reduce the problem.
   * - ``-9``
     - Not enough memory for OOC.
     - Raise ``MKL_PARDISO_OOC_MAX_CORE_SIZE``.  ``bench_ooc_vs_incore``
       doubles the value on this error up to a 32 GB ceiling.
   * - ``-10``
     - Can't open OOC files.
     - ``MKL_PARDISO_OOC_PATH`` doesn't exist or isn't writable.
       The pre-flight check in ``DirectMklPardisoSolver`` catches
       this before MKL does and raises a descriptive
       ``RuntimeError`` — if you see ``-10`` directly, the
       pre-flight was bypassed.
   * - ``-11``
     - Read/write error on OOC files.
     - Disk out of space, filesystem errored, or permissions
       changed mid-factor.  The pre-flight check verifies ~20 ×
       ``A.nnz × 12`` bytes free at construct time; deltas
       between estimate and actual can still trip this on very
       fragmented filesystems.

Validated scales (2026-04-24)
-----------------------------

The :file:`perf/bench_ooc_vs_incore.py` bench has verified the
OOC path at the following HEX20 plate sizes (4-thread MKL,
P-core pinned, fresh subprocess per size so ``ru_maxrss`` is the
true watermark):

.. list-table::
   :widths: 14 12 16 16 14 14 14
   :header-rows: 1

   * - Mesh
     - n_dof
     - in-core wall
     - in-core peak
     - OOC wall
     - OOC peak
     - MAX_CORE auto
   * - 192×192×2
     - 1 221 120
     - 19.4 s
     - 15.25 GB
     - 67.9 s
     - 11.96 GB
     - 5 312 MB
   * - 224×224×2
     - 1 660 716
     - 27.8 s
     - 21.41 GB
     - 97.4 s
     - 16.72 GB
     - 7 781 MB
   * - 256×256×2
     - 2 166 528
     - 36.6 s
     - 28.18 GB
     - 128.9 s
     - 21.99 GB
     - 10 397 MB

Every size in that table worked first-try with the auto-sized
``MAX_CORE_SIZE``; no retry-on-``-9`` doubling fired.  The
``-22 % peak / +3.5× wall`` trade-off is consistent across the
band — there is no "sweet spot" where OOC gets cheaper.

Known limitations
-----------------

**No bit-exact repeatability across MKL versions.**  The OOC
path's tile-size heuristics have shifted between oneMKL 2024.1,
2024.2 and 2025.3.  Wall-time deltas of ±10 % across a minor MKL
bump are normal; frequency results stay identical to the
in-core factor at every version we've tested.

**Can't combine with mixed precision.**  ``iparm[27] = 1``
(single-precision factor) and ``iparm[59] = 2`` (OOC) both
claim the in-core working buffer; MKL returns ``-9``
consistently when both are set.  ``ooc=True`` takes precedence
and leaves mixed precision off — mixed precision doesn't
actually halve our factor on this MKL build anyway (the flag is
silently no-op'd; ``iparm[27]`` reads back 0 after ``factorize``).

**Can't combine with the "improved two-level" parallel factor /
solve.**  ``iparm[23] = 1`` + ``iparm[24] = 2`` allocate
per-thread scratch the OOC path refuses to spill.  MKL returns
``-2`` ("not enough memory") under that combination on any
front larger than ~1 k DOFs.  When ``ooc=True`` the solver
drops to the classical parallel path (``iparm[23] = 0``,
``iparm[24] = 0``) — this is what gives the consistent ``+3.5×``
wall-time trade; losing that combination alone is worth ~2× per
solve.

**Factor spill size is not ``factor_nnz × 12``.**  The spill
file contains the factor + analysis scratch + recycled frontal
tiles.  On our 192-256² measurements the spill was ~3× the
``factor_nnz × 12`` estimate the pre-flight uses.  The
pre-flight's safety factor (``A.nnz × 20``) is tuned to
accommodate that ratio for 3D SPD meshes; for 2D meshes or
highly-loaded unsymmetric systems it may over- or under-shoot.

Further work
------------

Every item below is the stress-test work that the landed
``DirectMklPardisoSolver(ooc=True)`` path needs next — calling
them out here so the documentation stays a good TODO anchor
rather than a static description:

1. Push ``perf/bench_ooc_vs_incore.py`` past 3 M DOFs — our
   128 GB box can still fit 256×256×2 HEX20 in-core, so we
   haven't observed any sizes where OOC is the only option.
   Once we cross that boundary, the wall-time cost of OOC
   versus the alternative ("crash with -2") stops being a
   tradeoff and starts being the whole answer.
2. Validate against a non-MKL-default locale (Turkish, German)
   — MKL's env-var parser has historically had issues with
   non-ASCII filesystem paths.
3. Measure spill-file growth vs factor_nnz empirically at
   multiple sizes and refit the pre-flight's ``× 20`` safety
   factor if needed.