PBS rotor — solver head-to-head at 8 threads#

This page records the cross-solver wall-time + memory comparison on the PBS research rotor — a real industrial workload that drove the threading + linear-backend defaults documented on Solver defaults and threading.

The geometry is proprietary and not republished, but the mesh stats, BC, material, and the result table are the artefacts the docs intentionally publish so users can decide which backend to install.

Workload#

PBS rotor — single sector mesh with bore-clamp BC

Single sector of the 20-sector PBS rotor. 164,922 nodes, 41,084 SOLID186 cells, 494,766 free DOFs after the bore clamp. The 39 red nodes (cylindrical-radius \(r < 0.10\)) are pinned in all three translations to remove rigid-body modes and make the system positive-definite.#

field

value

deck

pbs-sector-hd.cdb (single sector)

element type

SOLID186 (HEX20 + degenerate TET10/WEDGE15/PYR13)

nodes

164,922

elements

41,084

total DOFs

494,766

clamped DOFs

117 (39 bore nodes × 3 translations)

free DOFs

494,649

K nnz

75,362,094

material

steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³

modes requested

10 lowest

Bench harness#

Same harness for every backend so the numbers are comparable:

  • 8 threads (OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8) seeded before numpy import. See Solver defaults and threading for why this is the default.

  • Wall time captured by time.perf_counter around the solve_modal(...) call (or, for MAPDL, the Elapsed Time line at the bottom of the .out file).

  • Peak resident memory captured by getrusage(RUSAGE_SELF).ru_maxrss for the femorph-solver rows; for MAPDL we record the Maximum Scratch Memory Used line (which is what dominates the factor — the database itself is <100 MB).

  • MAPDL reads the same CDB and applies the same bore clamp via CSYS,1 / NSEL,S,LOC,X,0,0.10 / D,ALL,ALL so MAPDL and the Python side solve the same problem under identical BCs.

Reproduction script: perf/bench_pbs_rotor.py.

Host: 32-core executor, 94 GiB RAM, NixOS 25.11.

Results#

Backend

Eigen × Linear

Wall (s)

Peak RSS (MiB)

f₁ (Hz)

Status

MAPDL 22.2 ``-np 8``

Block Lanczos (LANB)

30.0

10,652

0.47038

reference

femorph-solver

ARPACK σ=0 + mkl_direct

22.5

9,957

0.47038

fastest — 1.33× MAPDL

femorph-solver

ARPACK σ=0 + pardiso

32.6

9,957

0.47038

0.92× MAPDL

femorph-solver

ARPACK σ=0 + mumps

35.7

9,549

0.47038

smallest memory footprint

femorph-solver

LOBPCG + auto

DNF

did not converge in 900 s budget (ill-conditioned)

femorph-solver

PRIMME + auto

failed

PRIMME error -41 (matvec) — see PRIMME at scale note below

femorph-solver

ARPACK σ=0 + superlu

skipped

L+U fill-in OOMs at 500k SOLID186 DOFs

femorph-solver

ARPACK σ=0 + cholmod

n/a

SuiteSparse not buildable on this NixOS host; Linux/macOS users should pip install scikit-sparse and re-bench

femorph-solver

ARPACK σ=0 + cg / gmres

skipped

iterative inner solve not viable for shift-invert at half-million DOFs without a preconditioner

Frequency agreement: every direct backend (mkl_direct, pardiso, mumps) returned f₁ = 0.47038 Hz, matching MAPDL to 5 decimal places. The first 10 modes match MAPDL to ≤ 1 Hz on the highest mode (91 Hz range).

Headlines#

  • Use ``mkl_direct`` when MKL is available — it’s the fastest backend in this comparison and beats MAPDL Block Lanczos by 25 % at the same thread count, on the same mesh.

  • MUMPS is a close second for memory-constrained hosts: ~6 % smaller peak RSS than MAPDL, ~19 % slower wall.

  • ``pypardiso`` (the open-source MKL-Pardiso wrapper) sits between the two — picks up the multi-threaded factor but its Python wrapper has more per-call overhead than mkl_direct’s direct C-extension binding.

  • MAPDL Block Lanczos is competitive but not the fastest. The optimised in-house implementation has been tuned hard, but mkl_direct reaches the same answer faster on identical hardware at the same thread count.

PRIMME at scale#

PRIMME’s primme.eigsh faulted with error -41 (“error happened at the matvec”) on the 500k-DOF clamped sector. The same wheel solves the 60×60×2 SOLID185 plate and the 60-DOF tridiagonal synthetic without issue. Root-causing the matvec-error involves PRIMME’s internal Davidson restart logic on ill-conditioned-and-large K (the bore-clamp gives rcond 3e-18); neither the sigma=0 Davidson route nor the M-as-LinearOperator workaround resolved it.

Rather than chase the bug into PRIMME’s internals or build an OPinv-passing wrapper that duplicates what ARPACK + a linear backend already does (and beats MAPDL on this workload anyway), ``_autotune_eigen_solver`` now always picks ARPACK for the simple modal entry point as of 2026-05. PRIMME stays registered:

  • explicit eigen_solver="primme" still works on problems where Davidson converges (≤ ~100k DOFs, well-conditioned K),

  • femorph_solver.solvers.cyclic.solve_cyclic() uses PRIMME with an OPinv built from the chosen linear backend — the bug’s invisible there because PRIMME never falls back to its internal iterative inner solve.

So if you hit -41 on a real model, the migration path is:

solve_modal(K, M, prescribed,
            eigen_solver="arpack",            # the new default anyway
            linear_solver="mkl_direct")

LOBPCG also doesn’t converge here — the clamped-bore K is strongly ill-conditioned at the bore (rcond 3e-18) and LOBPCG’s lack of explicit shift-invert means it can’t pick out the lowest 10 modes from the high-frequency cluster of the disk within a 900-second budget. This is a property of the workload, not the implementation.

Why mkl_direct and not pardiso?#

Both wrap the same MKL Pardiso factor under the hood, but through different bindings:

  • mkl_direct calls MKL Pardiso through a direct ctypes binding bundled with the femorph-solver source — one factor + one solve per call, no per-call diagnostic overhead, no Python-side matrix-format conversion.

  • pardiso (pypardiso package) wraps MKL Pardiso behind a more user-friendly scipy.sparse.linalg-shaped interface; the wrapper does extra symbolic-mode bookkeeping and accepts a wider range of input formats. The convenience costs ~10 s on the 500k-DOF rotor — material when the goal is a tight loop.

For one-shot solves the wrapper overhead doesn’t matter. For inner shift-invert calls (ARPACK loops the inner solve dozens of times) the wrapper overhead compounds.

References#

  • Bench script: perf/bench_pbs_rotor.py

  • MAPDL APDL deck used in the comparison: perf/pbs_modal.inp

  • Cyclic-symmetry workload that drove the factorisation perf push: perf/cyclic-perf.md