.. _ooc-limits: Out-of-core (OOC) Pardiso — limits and tuning ============================================== ``DirectMklPardisoSolver(A, ooc=True)`` switches MKL's sparse Cholesky / LU factor onto its disk-backed path (``iparm(60) = 2``) — the factor + its scratch are spilled to a scratch file and streamed back in chunks at solve time. This keeps problems solvable on hosts where the in-core factor would exhaust RAM, at the cost of **~3.5× wall time** and the requirement that the MKL OOC knobs be configured correctly. This page documents what we've tested, where MKL stops cooperating, and the error-code → fix table. .. contents:: :local: :depth: 2 Required environment -------------------- MKL reads two environment variables at Pardiso construction time. Neither has a runtime-override equivalent inside the library, so they must be set **before** ``femorph_solver`` imports anything that touches MKL — in practice that means before any ``import femorph_solver`` line that resolves to the C extension. ``MKL_PARDISO_OOC_PATH`` Directory (or path prefix) where MKL writes the factor spill files. Must be writable. Each factor creates files named ``pardiso_ooc_*`` in that directory. Default ``.`` — the current working directory — which is rarely what you want. ``MKL_PARDISO_OOC_MAX_CORE_SIZE`` In-memory budget (MB) for the current frontal block. Below MKL's floor (~500 MB in 2024+ builds) any non-trivial 3D front returns error ``-9``. A safe heuristic is half the estimated factor size; ``perf/bench_ooc_vs_incore.py`` auto- picks this from the in-core pass's reported ``factor_nnz``. A third variable, ``MKL_PARDISO_OOC_KEEP_FILES``, is honoured by MKL but not used by ``DirectMklPardisoSolver`` — the solver always unlinks its spill files on teardown (phase ``-1``). Error-code table ---------------- ``DirectMklPardisoSolver`` maps every documented MKL Pardiso negative return to a human hint before raising ``RuntimeError``. The same hint text appears in :data:`~femorph_solver.solvers.linear._mkl_pardiso._MKL_ERROR_MESSAGES` for callers that want to pattern-match programmatically. .. list-table:: :widths: 8 30 40 :header-rows: 1 * - Code - Condition - Usual fix * - ``-1`` - Input inconsistent. - Check ``indptr`` / ``indices`` dtype (must be int32), sorted indices, and the CSR shape. * - ``-2`` - Not enough memory for the in-core factor. - Re-run with ``ooc=True`` or on a host with more RAM. * - ``-3`` - Reordering problem. - Check METIS availability; ``iparm[1]`` default is METIS but a corrupt install can surface here. * - ``-4`` - Zero pivot during numerical factorisation. - The matrix has a zero (or machine-zero) pivot at ``iparm[19] - 1``. Most commonly a BC misspec that leaves a row with no stiffness; inspect the row index MKL reports. * - ``-8`` - 32-bit integer overflow. - The matrix CSR exceeds int32 indexing. Not resolvable with Pardiso's current binding; switch to a 64-bit-integer backend (MUMPS int64 build) or reduce the problem. * - ``-9`` - Not enough memory for OOC. - Raise ``MKL_PARDISO_OOC_MAX_CORE_SIZE``. ``bench_ooc_vs_incore`` doubles the value on this error up to a 32 GB ceiling. * - ``-10`` - Can't open OOC files. - ``MKL_PARDISO_OOC_PATH`` doesn't exist or isn't writable. The pre-flight check in ``DirectMklPardisoSolver`` catches this before MKL does and raises a descriptive ``RuntimeError`` — if you see ``-10`` directly, the pre-flight was bypassed. * - ``-11`` - Read/write error on OOC files. - Disk out of space, filesystem errored, or permissions changed mid-factor. The pre-flight check verifies ~20 × ``A.nnz × 12`` bytes free at construct time; deltas between estimate and actual can still trip this on very fragmented filesystems. Validated scales (2026-04-24) ----------------------------- The :file:`perf/bench_ooc_vs_incore.py` bench has verified the OOC path at the following HEX20 plate sizes (4-thread MKL, P-core pinned, fresh subprocess per size so ``ru_maxrss`` is the true watermark): .. list-table:: :widths: 14 12 16 16 14 14 14 :header-rows: 1 * - Mesh - n_dof - in-core wall - in-core peak - OOC wall - OOC peak - MAX_CORE auto * - 192×192×2 - 1 221 120 - 19.4 s - 15.25 GB - 67.9 s - 11.96 GB - 5 312 MB * - 224×224×2 - 1 660 716 - 27.8 s - 21.41 GB - 97.4 s - 16.72 GB - 7 781 MB * - 256×256×2 - 2 166 528 - 36.6 s - 28.18 GB - 128.9 s - 21.99 GB - 10 397 MB Every size in that table worked first-try with the auto-sized ``MAX_CORE_SIZE``; no retry-on-``-9`` doubling fired. The ``-22 % peak / +3.5× wall`` trade-off is consistent across the band — there is no "sweet spot" where OOC gets cheaper. Known limitations ----------------- **No bit-exact repeatability across MKL versions.** The OOC path's tile-size heuristics have shifted between oneMKL 2024.1, 2024.2 and 2025.3. Wall-time deltas of ±10 % across a minor MKL bump are normal; frequency results stay identical to the in-core factor at every version we've tested. **Can't combine with mixed precision.** ``iparm[27] = 1`` (single-precision factor) and ``iparm[59] = 2`` (OOC) both claim the in-core working buffer; MKL returns ``-9`` consistently when both are set. ``ooc=True`` takes precedence and leaves mixed precision off — mixed precision doesn't actually halve our factor on this MKL build anyway (the flag is silently no-op'd; ``iparm[27]`` reads back 0 after ``factorize``). **Can't combine with the "improved two-level" parallel factor / solve.** ``iparm[23] = 1`` + ``iparm[24] = 2`` allocate per-thread scratch the OOC path refuses to spill. MKL returns ``-2`` ("not enough memory") under that combination on any front larger than ~1 k DOFs. When ``ooc=True`` the solver drops to the classical parallel path (``iparm[23] = 0``, ``iparm[24] = 0``) — this is what gives the consistent ``+3.5×`` wall-time trade; losing that combination alone is worth ~2× per solve. **Factor spill size is not ``factor_nnz × 12``.** The spill file contains the factor + analysis scratch + recycled frontal tiles. On our 192-256² measurements the spill was ~3× the ``factor_nnz × 12`` estimate the pre-flight uses. The pre-flight's safety factor (``A.nnz × 20``) is tuned to accommodate that ratio for 3D SPD meshes; for 2D meshes or highly-loaded unsymmetric systems it may over- or under-shoot. Further work ------------ Every item below is the stress-test work that the landed ``DirectMklPardisoSolver(ooc=True)`` path needs next — calling them out here so the documentation stays a good TODO anchor rather than a static description: 1. Push ``perf/bench_ooc_vs_incore.py`` past 3 M DOFs — our 128 GB box can still fit 256×256×2 HEX20 in-core, so we haven't observed any sizes where OOC is the only option. Once we cross that boundary, the wall-time cost of OOC versus the alternative ("crash with -2") stops being a tradeoff and starts being the whole answer. 2. Validate against a non-MKL-default locale (Turkish, German) — MKL's env-var parser has historically had issues with non-ASCII filesystem paths. 3. Measure spill-file growth vs factor_nnz empirically at multiple sizes and refit the pre-flight's ``× 20`` safety factor if needed.