Out-of-core (OOC) Pardiso — limits and tuning#
DirectMklPardisoSolver(A, ooc=True) switches MKL’s sparse
Cholesky / LU factor onto its disk-backed path
(iparm(60) = 2) — the factor + its scratch are spilled to a
scratch file and streamed back in chunks at solve time. This
keeps problems solvable on hosts where the in-core factor would
exhaust RAM, at the cost of ~3.5× wall time and the
requirement that the MKL OOC knobs be configured correctly.
This page documents what we’ve tested, where MKL stops cooperating, and the error-code → fix table.
Required environment#
MKL reads two environment variables at Pardiso construction time.
Neither has a runtime-override equivalent inside the library, so
they must be set before femorph_solver imports anything
that touches MKL — in practice that means before any
import femorph_solver line that resolves to the C extension.
MKL_PARDISO_OOC_PATHDirectory (or path prefix) where MKL writes the factor spill files. Must be writable. Each factor creates files named
pardiso_ooc_*in that directory. Default.— the current working directory — which is rarely what you want.MKL_PARDISO_OOC_MAX_CORE_SIZEIn-memory budget (MB) for the current frontal block. Below MKL’s floor (~500 MB in 2024+ builds) any non-trivial 3D front returns error
-9. A safe heuristic is half the estimated factor size;perf/bench_ooc_vs_incore.pyauto- picks this from the in-core pass’s reportedfactor_nnz.
A third variable, MKL_PARDISO_OOC_KEEP_FILES, is honoured
by MKL but not used by DirectMklPardisoSolver — the solver
always unlinks its spill files on teardown (phase -1).
Error-code table#
DirectMklPardisoSolver maps every documented MKL Pardiso
negative return to a human hint before raising
RuntimeError. The same hint text appears in
_MKL_ERROR_MESSAGES
for callers that want to pattern-match programmatically.
Code |
Condition |
Usual fix |
|---|---|---|
|
Input inconsistent. |
Check |
|
Not enough memory for the in-core factor. |
Re-run with |
|
Reordering problem. |
Check METIS availability; |
|
Zero pivot during numerical factorisation. |
The matrix has a zero (or machine-zero) pivot at
|
|
32-bit integer overflow. |
The matrix CSR exceeds int32 indexing. Not resolvable with Pardiso’s current binding; switch to a 64-bit-integer backend (MUMPS int64 build) or reduce the problem. |
|
Not enough memory for OOC. |
Raise |
|
Can’t open OOC files. |
|
|
Read/write error on OOC files. |
Disk out of space, filesystem errored, or permissions
changed mid-factor. The pre-flight check verifies ~20 ×
|
Validated scales (2026-04-24)#
The perf/bench_ooc_vs_incore.py bench has verified the
OOC path at the following HEX20 plate sizes (4-thread MKL,
P-core pinned, fresh subprocess per size so ru_maxrss is the
true watermark):
Mesh |
n_dof |
in-core wall |
in-core peak |
OOC wall |
OOC peak |
MAX_CORE auto |
|---|---|---|---|---|---|---|
192×192×2 |
1 221 120 |
19.4 s |
15.25 GB |
67.9 s |
11.96 GB |
5 312 MB |
224×224×2 |
1 660 716 |
27.8 s |
21.41 GB |
97.4 s |
16.72 GB |
7 781 MB |
256×256×2 |
2 166 528 |
36.6 s |
28.18 GB |
128.9 s |
21.99 GB |
10 397 MB |
Every size in that table worked first-try with the auto-sized
MAX_CORE_SIZE; no retry-on--9 doubling fired. The
-22 % peak / +3.5× wall trade-off is consistent across the
band — there is no “sweet spot” where OOC gets cheaper.
Known limitations#
No bit-exact repeatability across MKL versions. The OOC path’s tile-size heuristics have shifted between oneMKL 2024.1, 2024.2 and 2025.3. Wall-time deltas of ±10 % across a minor MKL bump are normal; frequency results stay identical to the in-core factor at every version we’ve tested.
Can’t combine with mixed precision. iparm[27] = 1
(single-precision factor) and iparm[59] = 2 (OOC) both
claim the in-core working buffer; MKL returns -9
consistently when both are set. ooc=True takes precedence
and leaves mixed precision off — mixed precision doesn’t
actually halve our factor on this MKL build anyway (the flag is
silently no-op’d; iparm[27] reads back 0 after factorize).
Can’t combine with the “improved two-level” parallel factor /
solve. iparm[23] = 1 + iparm[24] = 2 allocate
per-thread scratch the OOC path refuses to spill. MKL returns
-2 (“not enough memory”) under that combination on any
front larger than ~1 k DOFs. When ooc=True the solver
drops to the classical parallel path (iparm[23] = 0,
iparm[24] = 0) — this is what gives the consistent +3.5×
wall-time trade; losing that combination alone is worth ~2× per
solve.
Factor spill size is not ``factor_nnz × 12``. The spill
file contains the factor + analysis scratch + recycled frontal
tiles. On our 192-256² measurements the spill was ~3× the
factor_nnz × 12 estimate the pre-flight uses. The
pre-flight’s safety factor (A.nnz × 20) is tuned to
accommodate that ratio for 3D SPD meshes; for 2D meshes or
highly-loaded unsymmetric systems it may over- or under-shoot.
Further work#
Every item below is the stress-test work that the landed
DirectMklPardisoSolver(ooc=True) path needs next — calling
them out here so the documentation stays a good TODO anchor
rather than a static description:
Push
perf/bench_ooc_vs_incore.pypast 3 M DOFs — our 128 GB box can still fit 256×256×2 HEX20 in-core, so we haven’t observed any sizes where OOC is the only option. Once we cross that boundary, the wall-time cost of OOC versus the alternative (“crash with -2”) stops being a tradeoff and starts being the whole answer.Validate against a non-MKL-default locale (Turkish, German) — MKL’s env-var parser has historically had issues with non-ASCII filesystem paths.
Measure spill-file growth vs factor_nnz empirically at multiple sizes and refit the pre-flight’s
× 20safety factor if needed.