PBS rotor — solver head-to-head at 8 threads#

This page records the cross-solver wall-time + memory comparison on the PBS research rotor — a real industrial workload that drove the threading + linear-backend defaults documented on Solver defaults and threading.

The geometry is proprietary and not republished, but the mesh stats, BC, material, and the result table are the artefacts the docs intentionally publish so users can decide which backend to install.

Workload#

PBS rotor — single sector mesh with bore-clamp BC — Single sector of the 20-sector PBS rotor. 164,922 nodes, 41,084 SOLID186 cells, 494,766 free DOFs after the bore clamp. The 39 red nodes (cylindrical-radius \(r < 0.10\)) are pinned in all three translations to remove rigid-body modes and make the system positive-definite.#

field	value
deck	`pbs-sector-hd.cdb` (single sector)
element type	SOLID186 (HEX20 + degenerate TET10/WEDGE15/PYR13)
nodes	164,922
elements	41,084
total DOFs	494,766
clamped DOFs	117 (39 bore nodes × 3 translations)
free DOFs	494,649
K nnz	75,362,094
material	steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³
modes requested	10 lowest

Bench harness#

Same harness for every backend so the numbers are comparable:

8 threads (OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8) seeded before numpy import. See Solver defaults and threading for why this is the default.
Wall time captured by time.perf_counter around the solve_modal(...) call (or, for MAPDL, the Elapsed Time line at the bottom of the .out file).
Peak resident memory captured by getrusage(RUSAGE_SELF).ru_maxrss for the femorph-solver rows; for MAPDL we record the Maximum Scratch Memory Used line (which is what dominates the factor — the database itself is <100 MB).
MAPDL reads the same CDB and applies the same bore clamp via CSYS,1 / NSEL,S,LOC,X,0,0.10 / D,ALL,ALL so MAPDL and the Python side solve the same problem under identical BCs.

Reproduction script: perf/bench_pbs_rotor.py.

Host: 32-core executor, 94 GiB RAM, NixOS 25.11.

Results#

Backend	Eigen × Linear	Wall (s)	Peak RSS (MiB)	f₁ (Hz)	Status
MAPDL 22.2 ``-np 8``	Block Lanczos (LANB)	30.0	10,652	0.47038	reference
femorph-solver	ARPACK σ=0 + `mkl_direct`	22.5	9,957	0.47038	fastest — 1.33× MAPDL
femorph-solver	ARPACK σ=0 + `pardiso`	32.6	9,957	0.47038	0.92× MAPDL
femorph-solver	ARPACK σ=0 + `mumps`	35.7	9,549	0.47038	smallest memory footprint
femorph-solver	LOBPCG + auto	DNF	—	—	did not converge in 900 s budget (ill-conditioned)
femorph-solver	PRIMME + auto	failed	—	—	PRIMME error -41 (matvec) — see PRIMME at scale note below
femorph-solver	ARPACK σ=0 + `superlu`	skipped	—	—	L+U fill-in OOMs at 500k SOLID186 DOFs
femorph-solver	ARPACK σ=0 + `cholmod`	n/a	—	—	SuiteSparse not buildable on this NixOS host; Linux/macOS users should `pip install scikit-sparse` and re-bench
femorph-solver	ARPACK σ=0 + `cg` / `gmres`	skipped	—	—	iterative inner solve not viable for shift-invert at half-million DOFs without a preconditioner

Frequency agreement: every direct backend (mkl_direct, pardiso, mumps) returned f₁ = 0.47038 Hz, matching MAPDL to 5 decimal places. The first 10 modes match MAPDL to ≤ 1 Hz on the highest mode (91 Hz range).

Headlines#

Use ``mkl_direct`` when MKL is available — it’s the fastest backend in this comparison and beats MAPDL Block Lanczos by 25 % at the same thread count, on the same mesh.
MUMPS is a close second for memory-constrained hosts: ~6 % smaller peak RSS than MAPDL, ~19 % slower wall.
``pypardiso`` (the open-source MKL-Pardiso wrapper) sits between the two — picks up the multi-threaded factor but its Python wrapper has more per-call overhead than mkl_direct’s direct C-extension binding.
MAPDL Block Lanczos is competitive but not the fastest. The optimised in-house implementation has been tuned hard, but mkl_direct reaches the same answer faster on identical hardware at the same thread count.

PRIMME at scale#

PRIMME’s primme.eigsh faulted with error -41 (“error happened at the matvec”) on the 500k-DOF clamped sector. The same wheel solves the 60×60×2 SOLID185 plate and the 60-DOF tridiagonal synthetic without issue. Root-causing the matvec-error involves PRIMME’s internal Davidson restart logic on ill-conditioned-and-large K (the bore-clamp gives rcond ≈ 3e-18); neither the sigma=0 → Davidson route nor the M-as-LinearOperator workaround resolved it.

Rather than chase the bug into PRIMME’s internals or build an OPinv-passing wrapper that duplicates what ARPACK + a linear backend already does (and beats MAPDL on this workload anyway), ``_autotune_eigen_solver`` now always picks ARPACK for the simple modal entry point as of 2026-05. PRIMME stays registered:

explicit eigen_solver="primme" still works on problems where Davidson converges (≤ ~100k DOFs, well-conditioned K),
femorph_solver.solvers.cyclic.solve_cyclic() uses PRIMME with an OPinv built from the chosen linear backend — the bug’s invisible there because PRIMME never falls back to its internal iterative inner solve.

So if you hit -41 on a real model, the migration path is:

solve_modal(K, M, prescribed,
            eigen_solver="arpack",            # the new default anyway
            linear_solver="mkl_direct")

LOBPCG also doesn’t converge here — the clamped-bore K is strongly ill-conditioned at the bore (rcond ≈ 3e-18) and LOBPCG’s lack of explicit shift-invert means it can’t pick out the lowest 10 modes from the high-frequency cluster of the disk within a 900-second budget. This is a property of the workload, not the implementation.

Why `mkl_direct` and not `pardiso`?#

Both wrap the same MKL Pardiso factor under the hood, but through different bindings:

mkl_direct calls MKL Pardiso through a direct ctypes binding bundled with the femorph-solver source — one factor + one solve per call, no per-call diagnostic overhead, no Python-side matrix-format conversion.
pardiso (pypardiso package) wraps MKL Pardiso behind a more user-friendly scipy.sparse.linalg-shaped interface; the wrapper does extra symbolic-mode bookkeeping and accepts a wider range of input formats. The convenience costs ~10 s on the 500k-DOF rotor — material when the goal is a tight loop.

For one-shot solves the wrapper overhead doesn’t matter. For inner shift-invert calls (ARPACK loops the inner solve dozens of times) the wrapper overhead compounds.

References#

Bench script: perf/bench_pbs_rotor.py
MAPDL APDL deck used in the comparison: perf/pbs_modal.inp
Cyclic-symmetry workload that drove the factorisation perf push: perf/cyclic-perf.md