.. _pbs-rotor-comparison:

PBS rotor — solver head-to-head at 8 threads
============================================

This page records the cross-solver wall-time + memory comparison on
the PBS research rotor — a real industrial workload that drove the
threading + linear-backend defaults documented on
:doc:`solver-defaults-and-threading`.

The geometry is proprietary and not republished, but the mesh stats,
BC, material, and the result table are the artefacts the docs
intentionally publish so users can decide which backend to install.

Workload
--------

.. figure:: /_static/figures/pbs_rotor_sector_mesh.png
   :align: center
   :width: 80%
   :alt: PBS rotor — single sector mesh with bore-clamp BC

   Single sector of the 20-sector PBS rotor.  164,922 nodes,
   41,084 SOLID186 cells, 494,766 free DOFs after the bore clamp.
   The 39 red nodes (cylindrical-radius :math:`r < 0.10`) are
   pinned in all three translations to remove rigid-body modes
   and make the system positive-definite.

================  =================================================
field             value
================  =================================================
deck              ``pbs-sector-hd.cdb`` (single sector)
element type      SOLID186 (HEX20 + degenerate TET10/WEDGE15/PYR13)
nodes             164,922
elements          41,084
total DOFs        494,766
clamped DOFs      117 (39 bore nodes × 3 translations)
free DOFs         494,649
K nnz             75,362,094
material          steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³
modes requested   10 lowest
================  =================================================

Bench harness
-------------

Same harness for every backend so the numbers are comparable:

* 8 threads (``OMP_NUM_THREADS = MKL_NUM_THREADS =
  OPENBLAS_NUM_THREADS = 8``) seeded **before** numpy import.
  See :doc:`solver-defaults-and-threading` for why this is the default.
* Wall time captured by ``time.perf_counter`` around the
  ``solve_modal(...)`` call (or, for MAPDL, the
  ``Elapsed Time`` line at the bottom of the ``.out`` file).
* Peak resident memory captured by ``getrusage(RUSAGE_SELF).ru_maxrss``
  for the femorph-solver rows; for MAPDL we record the
  ``Maximum Scratch Memory Used`` line (which is what dominates the
  factor — the database itself is <100 MB).
* MAPDL reads the same CDB and applies the same bore clamp via
  ``CSYS,1 / NSEL,S,LOC,X,0,0.10 / D,ALL,ALL`` so MAPDL and the
  Python side solve **the same problem** under identical BCs.

Reproduction script: :file:`perf/bench_pbs_rotor.py`.

Host: 32-core executor, 94 GiB RAM, NixOS 25.11.

Results
-------

.. list-table::
   :header-rows: 1
   :widths: 26 20 14 14 12 14

   * - Backend
     - Eigen × Linear
     - Wall (s)
     - Peak RSS (MiB)
     - f₁ (Hz)
     - Status
   * - **MAPDL 22.2 ``-np 8``**
     - Block Lanczos (LANB)
     - **30.0**
     - 10,652
     - 0.47038
     - reference
   * - femorph-solver
     - ARPACK σ=0 + ``mkl_direct``
     - **22.5**
     - 9,957
     - 0.47038
     - **fastest — 1.33× MAPDL**
   * - femorph-solver
     - ARPACK σ=0 + ``pardiso``
     - 32.6
     - 9,957
     - 0.47038
     - 0.92× MAPDL
   * - femorph-solver
     - ARPACK σ=0 + ``mumps``
     - 35.7
     - **9,549**
     - 0.47038
     - smallest memory footprint
   * - femorph-solver
     - LOBPCG + auto
     - DNF
     - —
     - —
     - did not converge in 900 s budget (ill-conditioned)
   * - femorph-solver
     - PRIMME + auto
     - failed
     - —
     - —
     - PRIMME error -41 (matvec) — see *PRIMME at scale* note below
   * - femorph-solver
     - ARPACK σ=0 + ``superlu``
     - skipped
     - —
     - —
     - L+U fill-in OOMs at 500k SOLID186 DOFs
   * - femorph-solver
     - ARPACK σ=0 + ``cholmod``
     - n/a
     - —
     - —
     - SuiteSparse not buildable on this NixOS host;
       Linux/macOS users should ``pip install scikit-sparse``
       and re-bench
   * - femorph-solver
     - ARPACK σ=0 + ``cg`` / ``gmres``
     - skipped
     - —
     - —
     - iterative inner solve not viable for shift-invert at
       half-million DOFs without a preconditioner

Frequency agreement: every direct backend (``mkl_direct``,
``pardiso``, ``mumps``) returned f₁ = 0.47038 Hz, matching MAPDL
to 5 decimal places.  The first 10 modes match MAPDL to ≤ 1 Hz on
the highest mode (91 Hz range).

Headlines
~~~~~~~~~

* **Use ``mkl_direct``** when MKL is available — it's the fastest
  backend in this comparison and beats MAPDL Block Lanczos by 25 %
  at the same thread count, on the same mesh.
* **MUMPS** is a close second for memory-constrained hosts: ~6 %
  smaller peak RSS than MAPDL, ~19 % slower wall.
* **``pypardiso``** (the open-source MKL-Pardiso wrapper) sits
  between the two — picks up the multi-threaded factor but its
  Python wrapper has more per-call overhead than ``mkl_direct``'s
  direct C-extension binding.
* **MAPDL Block Lanczos is competitive but not the fastest.**
  The optimised in-house implementation has been tuned hard, but
  ``mkl_direct`` reaches the same answer faster on identical
  hardware at the same thread count.

PRIMME at scale
---------------

PRIMME's ``primme.eigsh`` faulted with error -41 ("error happened
at the matvec") on the 500k-DOF clamped sector.  The same wheel
solves the 60×60×2 SOLID185 plate and the 60-DOF tridiagonal
synthetic without issue.  Root-causing the matvec-error
involves PRIMME's internal Davidson restart logic on
ill-conditioned-and-large K (the bore-clamp gives
``rcond ≈ 3e-18``); neither the ``sigma=0 → Davidson`` route nor
the M-as-LinearOperator workaround resolved it.

Rather than chase the bug into PRIMME's internals or build an
OPinv-passing wrapper that duplicates what ARPACK + a linear
backend already does (and beats MAPDL on this workload anyway),
**``_autotune_eigen_solver`` now always picks ARPACK** for the
simple modal entry point as of 2026-05.  PRIMME stays registered:

* explicit ``eigen_solver="primme"`` still works on problems
  where Davidson converges (≤ ~100k DOFs, well-conditioned K),
* :func:`femorph_solver.solvers.cyclic.solve_cyclic` uses PRIMME
  *with an OPinv* built from the chosen linear backend — the
  bug's invisible there because PRIMME never falls back to its
  internal iterative inner solve.

So if you hit -41 on a real model, the migration path is::

    solve_modal(K, M, prescribed,
                eigen_solver="arpack",            # the new default anyway
                linear_solver="mkl_direct")

LOBPCG also doesn't converge here — the clamped-bore K is
strongly ill-conditioned at the bore (``rcond ≈ 3e-18``) and
LOBPCG's lack of explicit shift-invert means it can't pick out
the lowest 10 modes from the high-frequency cluster of the disk
within a 900-second budget.  This is a property of the workload,
not the implementation.

Why ``mkl_direct`` and not ``pardiso``?
---------------------------------------

Both wrap the same MKL Pardiso factor under the hood, but
through different bindings:

* ``mkl_direct`` calls MKL Pardiso through a direct
  ``ctypes`` binding bundled with the femorph-solver source —
  one factor + one solve per call, no per-call diagnostic
  overhead, no Python-side matrix-format conversion.
* ``pardiso`` (``pypardiso`` package) wraps MKL Pardiso behind
  a more user-friendly ``scipy.sparse.linalg``-shaped interface;
  the wrapper does extra symbolic-mode bookkeeping and accepts a
  wider range of input formats.  The convenience costs ~10 s on
  the 500k-DOF rotor — material when the goal is a tight loop.

For one-shot solves the wrapper overhead doesn't matter.  For
inner shift-invert calls (ARPACK loops the inner solve dozens
of times) the wrapper overhead compounds.

References
----------

* Bench script: :file:`perf/bench_pbs_rotor.py`
* MAPDL APDL deck used in the comparison:
  :file:`perf/pbs_modal.inp`
* Cyclic-symmetry workload that drove the factorisation perf
  push: :file:`perf/cyclic-perf.md`