.. _pbs-rotor-comparison: PBS rotor — solver head-to-head at 8 threads ============================================ This page records the cross-solver wall-time + memory comparison on the PBS research rotor — a real industrial workload that drove the threading + linear-backend defaults documented on :doc:`solver-defaults-and-threading`. The geometry is proprietary and not republished, but the mesh stats, BC, material, and the result table are the artefacts the docs intentionally publish so users can decide which backend to install. Workload -------- .. figure:: /_static/figures/pbs_rotor_sector_mesh.png :align: center :width: 80% :alt: PBS rotor — single sector mesh with bore-clamp BC Single sector of the 20-sector PBS rotor. 164,922 nodes, 41,084 SOLID186 cells, 494,766 free DOFs after the bore clamp. The 39 red nodes (cylindrical-radius :math:`r < 0.10`) are pinned in all three translations to remove rigid-body modes and make the system positive-definite. ================ ================================================= field value ================ ================================================= deck ``pbs-sector-hd.cdb`` (single sector) element type SOLID186 (HEX20 + degenerate TET10/WEDGE15/PYR13) nodes 164,922 elements 41,084 total DOFs 494,766 clamped DOFs 117 (39 bore nodes × 3 translations) free DOFs 494,649 K nnz 75,362,094 material steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³ modes requested 10 lowest ================ ================================================= Bench harness ------------- Same harness for every backend so the numbers are comparable: * 8 threads (``OMP_NUM_THREADS = MKL_NUM_THREADS = OPENBLAS_NUM_THREADS = 8``) seeded **before** numpy import. See :doc:`solver-defaults-and-threading` for why this is the default. * Wall time captured by ``time.perf_counter`` around the ``solve_modal(...)`` call (or, for MAPDL, the ``Elapsed Time`` line at the bottom of the ``.out`` file). * Peak resident memory captured by ``getrusage(RUSAGE_SELF).ru_maxrss`` for the femorph-solver rows; for MAPDL we record the ``Maximum Scratch Memory Used`` line (which is what dominates the factor — the database itself is <100 MB). * MAPDL reads the same CDB and applies the same bore clamp via ``CSYS,1 / NSEL,S,LOC,X,0,0.10 / D,ALL,ALL`` so MAPDL and the Python side solve **the same problem** under identical BCs. Reproduction script: :file:`perf/bench_pbs_rotor.py`. Host: 32-core executor, 94 GiB RAM, NixOS 25.11. Results ------- .. list-table:: :header-rows: 1 :widths: 26 20 14 14 12 14 * - Backend - Eigen × Linear - Wall (s) - Peak RSS (MiB) - f₁ (Hz) - Status * - **MAPDL 22.2 ``-np 8``** - Block Lanczos (LANB) - **30.0** - 10,652 - 0.47038 - reference * - femorph-solver - ARPACK σ=0 + ``mkl_direct`` - **22.5** - 9,957 - 0.47038 - **fastest — 1.33× MAPDL** * - femorph-solver - ARPACK σ=0 + ``pardiso`` - 32.6 - 9,957 - 0.47038 - 0.92× MAPDL * - femorph-solver - ARPACK σ=0 + ``mumps`` - 35.7 - **9,549** - 0.47038 - smallest memory footprint * - femorph-solver - LOBPCG + auto - DNF - — - — - did not converge in 900 s budget (ill-conditioned) * - femorph-solver - PRIMME + auto - failed - — - — - PRIMME error -41 (matvec) — see *PRIMME at scale* note below * - femorph-solver - ARPACK σ=0 + ``superlu`` - skipped - — - — - L+U fill-in OOMs at 500k SOLID186 DOFs * - femorph-solver - ARPACK σ=0 + ``cholmod`` - n/a - — - — - SuiteSparse not buildable on this NixOS host; Linux/macOS users should ``pip install scikit-sparse`` and re-bench * - femorph-solver - ARPACK σ=0 + ``cg`` / ``gmres`` - skipped - — - — - iterative inner solve not viable for shift-invert at half-million DOFs without a preconditioner Frequency agreement: every direct backend (``mkl_direct``, ``pardiso``, ``mumps``) returned f₁ = 0.47038 Hz, matching MAPDL to 5 decimal places. The first 10 modes match MAPDL to ≤ 1 Hz on the highest mode (91 Hz range). Headlines ~~~~~~~~~ * **Use ``mkl_direct``** when MKL is available — it's the fastest backend in this comparison and beats MAPDL Block Lanczos by 25 % at the same thread count, on the same mesh. * **MUMPS** is a close second for memory-constrained hosts: ~6 % smaller peak RSS than MAPDL, ~19 % slower wall. * **``pypardiso``** (the open-source MKL-Pardiso wrapper) sits between the two — picks up the multi-threaded factor but its Python wrapper has more per-call overhead than ``mkl_direct``'s direct C-extension binding. * **MAPDL Block Lanczos is competitive but not the fastest.** The optimised in-house implementation has been tuned hard, but ``mkl_direct`` reaches the same answer faster on identical hardware at the same thread count. PRIMME at scale --------------- PRIMME's ``primme.eigsh`` faulted with error -41 ("error happened at the matvec") on the 500k-DOF clamped sector. The same wheel solves the 60×60×2 SOLID185 plate and the 60-DOF tridiagonal synthetic without issue. Root-causing the matvec-error involves PRIMME's internal Davidson restart logic on ill-conditioned-and-large K (the bore-clamp gives ``rcond ≈ 3e-18``); neither the ``sigma=0 → Davidson`` route nor the M-as-LinearOperator workaround resolved it. Rather than chase the bug into PRIMME's internals or build an OPinv-passing wrapper that duplicates what ARPACK + a linear backend already does (and beats MAPDL on this workload anyway), **``_autotune_eigen_solver`` now always picks ARPACK** for the simple modal entry point as of 2026-05. PRIMME stays registered: * explicit ``eigen_solver="primme"`` still works on problems where Davidson converges (≤ ~100k DOFs, well-conditioned K), * :func:`femorph_solver.solvers.cyclic.solve_cyclic` uses PRIMME *with an OPinv* built from the chosen linear backend — the bug's invisible there because PRIMME never falls back to its internal iterative inner solve. So if you hit -41 on a real model, the migration path is:: solve_modal(K, M, prescribed, eigen_solver="arpack", # the new default anyway linear_solver="mkl_direct") LOBPCG also doesn't converge here — the clamped-bore K is strongly ill-conditioned at the bore (``rcond ≈ 3e-18``) and LOBPCG's lack of explicit shift-invert means it can't pick out the lowest 10 modes from the high-frequency cluster of the disk within a 900-second budget. This is a property of the workload, not the implementation. Why ``mkl_direct`` and not ``pardiso``? --------------------------------------- Both wrap the same MKL Pardiso factor under the hood, but through different bindings: * ``mkl_direct`` calls MKL Pardiso through a direct ``ctypes`` binding bundled with the femorph-solver source — one factor + one solve per call, no per-call diagnostic overhead, no Python-side matrix-format conversion. * ``pardiso`` (``pypardiso`` package) wraps MKL Pardiso behind a more user-friendly ``scipy.sparse.linalg``-shaped interface; the wrapper does extra symbolic-mode bookkeeping and accepts a wider range of input formats. The convenience costs ~10 s on the 500k-DOF rotor — material when the goal is a tight loop. For one-shot solves the wrapper overhead doesn't matter. For inner shift-invert calls (ARPACK loops the inner solve dozens of times) the wrapper overhead compounds. References ---------- * Bench script: :file:`perf/bench_pbs_rotor.py` * MAPDL APDL deck used in the comparison: :file:`perf/pbs_modal.inp` * Cyclic-symmetry workload that drove the factorisation perf push: :file:`perf/cyclic-perf.md`