PBS rotor — solver head-to-head at 8 threads#
This page records the cross-solver wall-time + memory comparison on the PBS research rotor — a real industrial workload that drove the threading + linear-backend defaults documented on Solver defaults and threading.
The geometry is proprietary and not republished, but the mesh stats, BC, material, and the result table are the artefacts the docs intentionally publish so users can decide which backend to install.
Workload#
Single sector of the 20-sector PBS rotor. 164,922 nodes, 41,084 SOLID186 cells, 494,766 free DOFs after the bore clamp. The 39 red nodes (cylindrical-radius \(r < 0.10\)) are pinned in all three translations to remove rigid-body modes and make the system positive-definite.#
field |
value |
|---|---|
deck |
|
element type |
SOLID186 (HEX20 + degenerate TET10/WEDGE15/PYR13) |
nodes |
164,922 |
elements |
41,084 |
total DOFs |
494,766 |
clamped DOFs |
117 (39 bore nodes × 3 translations) |
free DOFs |
494,649 |
K nnz |
75,362,094 |
material |
steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³ |
modes requested |
10 lowest |
Bench harness#
Same harness for every backend so the numbers are comparable:
8 threads (
OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8) seeded before numpy import. See Solver defaults and threading for why this is the default.Wall time captured by
time.perf_counteraround thesolve_modal(...)call (or, for MAPDL, theElapsed Timeline at the bottom of the.outfile).Peak resident memory captured by
getrusage(RUSAGE_SELF).ru_maxrssfor the femorph-solver rows; for MAPDL we record theMaximum Scratch Memory Usedline (which is what dominates the factor — the database itself is <100 MB).MAPDL reads the same CDB and applies the same bore clamp via
CSYS,1 / NSEL,S,LOC,X,0,0.10 / D,ALL,ALLso MAPDL and the Python side solve the same problem under identical BCs.
Reproduction script: perf/bench_pbs_rotor.py.
Host: 32-core executor, 94 GiB RAM, NixOS 25.11.
Results#
Backend |
Eigen × Linear |
Wall (s) |
Peak RSS (MiB) |
f₁ (Hz) |
Status |
|---|---|---|---|---|---|
MAPDL 22.2 ``-np 8`` |
Block Lanczos (LANB) |
30.0 |
10,652 |
0.47038 |
reference |
femorph-solver |
ARPACK σ=0 + |
22.5 |
9,957 |
0.47038 |
fastest — 1.33× MAPDL |
femorph-solver |
ARPACK σ=0 + |
32.6 |
9,957 |
0.47038 |
0.92× MAPDL |
femorph-solver |
ARPACK σ=0 + |
35.7 |
9,549 |
0.47038 |
smallest memory footprint |
femorph-solver |
LOBPCG + auto |
DNF |
— |
— |
did not converge in 900 s budget (ill-conditioned) |
femorph-solver |
PRIMME + auto |
failed |
— |
— |
PRIMME error -41 (matvec) — see PRIMME at scale note below |
femorph-solver |
ARPACK σ=0 + |
skipped |
— |
— |
L+U fill-in OOMs at 500k SOLID186 DOFs |
femorph-solver |
ARPACK σ=0 + |
n/a |
— |
— |
SuiteSparse not buildable on this NixOS host;
Linux/macOS users should |
femorph-solver |
ARPACK σ=0 + |
skipped |
— |
— |
iterative inner solve not viable for shift-invert at half-million DOFs without a preconditioner |
Frequency agreement: every direct backend (mkl_direct,
pardiso, mumps) returned f₁ = 0.47038 Hz, matching MAPDL
to 5 decimal places. The first 10 modes match MAPDL to ≤ 1 Hz on
the highest mode (91 Hz range).
Headlines#
Use ``mkl_direct`` when MKL is available — it’s the fastest backend in this comparison and beats MAPDL Block Lanczos by 25 % at the same thread count, on the same mesh.
MUMPS is a close second for memory-constrained hosts: ~6 % smaller peak RSS than MAPDL, ~19 % slower wall.
``pypardiso`` (the open-source MKL-Pardiso wrapper) sits between the two — picks up the multi-threaded factor but its Python wrapper has more per-call overhead than
mkl_direct’s direct C-extension binding.MAPDL Block Lanczos is competitive but not the fastest. The optimised in-house implementation has been tuned hard, but
mkl_directreaches the same answer faster on identical hardware at the same thread count.
PRIMME at scale#
PRIMME’s primme.eigsh faulted with error -41 (“error happened
at the matvec”) on the 500k-DOF clamped sector. The same wheel
solves the 60×60×2 SOLID185 plate and the 60-DOF tridiagonal
synthetic without issue. Root-causing the matvec-error
involves PRIMME’s internal Davidson restart logic on
ill-conditioned-and-large K (the bore-clamp gives
rcond ≈ 3e-18); neither the sigma=0 → Davidson route nor
the M-as-LinearOperator workaround resolved it.
Rather than chase the bug into PRIMME’s internals or build an OPinv-passing wrapper that duplicates what ARPACK + a linear backend already does (and beats MAPDL on this workload anyway), ``_autotune_eigen_solver`` now always picks ARPACK for the simple modal entry point as of 2026-05. PRIMME stays registered:
explicit
eigen_solver="primme"still works on problems where Davidson converges (≤ ~100k DOFs, well-conditioned K),femorph_solver.solvers.cyclic.solve_cyclic()uses PRIMME with an OPinv built from the chosen linear backend — the bug’s invisible there because PRIMME never falls back to its internal iterative inner solve.
So if you hit -41 on a real model, the migration path is:
solve_modal(K, M, prescribed,
eigen_solver="arpack", # the new default anyway
linear_solver="mkl_direct")
LOBPCG also doesn’t converge here — the clamped-bore K is
strongly ill-conditioned at the bore (rcond ≈ 3e-18) and
LOBPCG’s lack of explicit shift-invert means it can’t pick out
the lowest 10 modes from the high-frequency cluster of the disk
within a 900-second budget. This is a property of the workload,
not the implementation.
Why mkl_direct and not pardiso?#
Both wrap the same MKL Pardiso factor under the hood, but through different bindings:
mkl_directcalls MKL Pardiso through a directctypesbinding bundled with the femorph-solver source — one factor + one solve per call, no per-call diagnostic overhead, no Python-side matrix-format conversion.pardiso(pypardisopackage) wraps MKL Pardiso behind a more user-friendlyscipy.sparse.linalg-shaped interface; the wrapper does extra symbolic-mode bookkeeping and accepts a wider range of input formats. The convenience costs ~10 s on the 500k-DOF rotor — material when the goal is a tight loop.
For one-shot solves the wrapper overhead doesn’t matter. For inner shift-invert calls (ARPACK loops the inner solve dozens of times) the wrapper overhead compounds.
References#
Bench script:
perf/bench_pbs_rotor.pyMAPDL APDL deck used in the comparison:
perf/pbs_modal.inpCyclic-symmetry workload that drove the factorisation perf push:
perf/cyclic-perf.md