.. _pbs-rotor-cyclic-comparison:

PBS rotor — cyclic-modal sweep at 8 threads
============================================

Companion to :doc:`pbs-rotor-comparison` — the same PBS sector, but
solved as a **cyclic-symmetry modal sweep** (full harmonic ladder
:math:`k = 0, 1, \ldots, N/2`) instead of a single non-cyclic modal.
This is the workload that actually motivates installing PRIMME.

The 20-sector PBS rotor closes at :math:`N = 20`, so the sweep
covers 11 harmonics (:math:`k = 0, 1, \ldots, 10`).  Each harmonic
solves a complex Hermitian generalised eigenproblem of the same
~half-million-DOF size as the simple-modal page; the linear
backend's factor is reused across harmonics, so the eigensolver's
restart efficiency dominates total wall time.

Workload
--------

.. list-table::
   :header-rows: 1
   :widths: 25 75

   * - field
     - value
   * - deck
     - ``pbs-sector-hd.cdb`` (single sector)
   * - sectors N
     - 20 (auto-detected)
   * - harmonics solved
     - 11 — full sweep k = 0..10
   * - modes per k
     - 4
   * - nodes
     - 164,922
   * - elements
     - 41,084
   * - total DOFs
     - 494,766
   * - K nnz
     - 75,362,094
   * - paired interfaces
     - 2,005 nodes
   * - material
     - steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³

The mesh image is the same as in :doc:`pbs-rotor-comparison`.

Bench harness
-------------

``perf/bench_pbs_cyclic.py`` drives ``solve_cyclic`` over all
viable backend × eigen combinations at 8 threads.  Same env-var
seed pattern as the simple-modal bench (``OMP_NUM_THREADS = 8``
before numpy import); peak RSS via ``getrusage``; per-backend wall
budget 1800 s.

Eigen backends in cyclic:

* **scipy** — routes to ARPACK (``scipy.sparse.linalg.eigsh``)
  on the dense-Hermitian per-harmonic eigensolve.
* **primme** — ``primme.eigsh`` with **OPinv** built from the
  chosen linear backend (i.e. PRIMME calls
  ``(K_red - σ M_red)^{-1}`` through Pardiso/MUMPS/etc.).  This
  is the path where PRIMME's JDQMR algorithm pays off and where
  the simple-modal wrapper's matvec-error issue
  (:doc:`pbs-rotor-comparison`) doesn't show up.

MAPDL parity at strict 8-thread cap
-----------------------------------

For a fair head-to-head we ran MAPDL Block Lanczos on the same
sector with the same cyclic boundary conditions, the same 11
harmonics, 4 modes per harmonic, ``-np 8`` SMP under
``taskset -c 0-7`` (the default in ``ansys-docker/start-ansys.sh``
on this host).  Then we re-ran the femorph PRIMME+``mkl_direct``
combo under the same ``taskset -c 0-7`` so neither side can
oversubscribe past 8 cores.

.. list-table::
   :header-rows: 1
   :widths: 32 16 18 16 18

   * - Solver
     - Wall (s)
     - Peak RSS (MiB)
     - Ratio vs MAPDL
     - Notes
   * - **MAPDL 22.2 LANB ``-np 8``**
     - **787.5**
     - 24,506
     - 1.00× (baseline)
     - SMP mode (no MPI in this image); CP time 5045 s ⇒ ~6.4× SMP scaling
   * - **femorph PRIMME + ``mkl_direct`` (taskset -c 0-7)**
     - **631.7**
     - 24,642
     - **0.80× (1.25× faster than MAPDL)**
     - 23 threads pile-up onto 8 cores; MUMPS / MKL / OpenMP runtimes don't all honour ``OMP_NUM_THREADS=8`` so context-switch overhead is real

So **femorph beats MAPDL by 1.25× on the cyclic-modal sweep at
the same strict 8-thread cap**, with essentially the same peak
memory (~24.5 GB on both sides).  The PRIMME JDQMR algorithm + a
sparse direct factor (Pardiso / MUMPS / MKL Direct) really is the
right tool for this workload.

Thread-cap honesty note
~~~~~~~~~~~~~~~~~~~~~~~

The unconstrained femorph bench (env vars only, no ``taskset``)
finishes in **447 s** on the same host — but that's because
MUMPS' libmumps internal pool, MKL Pardiso's worker pool, and
SciPy's BLAS pool *each* spawn their own threads on top of the
``OMP_NUM_THREADS=8`` env cap.  Sampling ``/proc/self/status`` mid-
solve shows ``Threads: 23`` even with all the standard env vars
set to 8.  ``taskset -c 0-7`` is the only mechanically-enforced
8-core cap on this host.

* If your workload is the only thing on the box → the 447 s number
  is what you'll see.
* If you share the host (CI, multi-tenant, batch system) →
  ``taskset -c 0-N`` and budget ~40 % overhead vs the
  unconstrained number.

MAPDL is fairer about its ``-np 8`` flag — its summed-thread CP
time (5045 s) ÷ wall (787 s) gives 6.4× effective parallelism,
consistent with 8 threads bounded to 8 cores with normal
serial-section overhead.

Results
-------

.. list-table::
   :header-rows: 1
   :widths: 22 16 16 14 14 18

   * - Backend
     - Eigen × Linear
     - Wall (s)
     - Peak RSS (MiB)
     - Speedup vs ARPACK
     - Notes
   * - femorph-solver
     - ARPACK + ``mumps``
     - 1120.78
     - 33,012
     - 1.00× (baseline)
     - reference
   * - femorph-solver
     - ARPACK + ``pardiso``
     - 1144.59
     - 33,012
     - 0.98×
     - same factor cost; minor wrapper overhead
   * - femorph-solver
     - ARPACK + ``mkl_direct``
     - 1148.60
     - 33,012
     - 0.98×
     - same factor cost; equivalent to pardiso here
   * - femorph-solver
     - **PRIMME + ``mumps``**
     - **514.35**
     - 34,198
     - **2.18×**
     - JDQMR amortises factor across harmonics
   * - femorph-solver
     - **PRIMME + ``pardiso``**
     - 447.81
     - 34,198
     - 2.50×
     - second-fastest
   * - femorph-solver
     - **PRIMME + ``mkl_direct``**
     - **447.00**
     - 34,198
     - **2.51×**
     - **fastest combination**

f₁ at every harmonic agrees to ≥ 6 digits across all six
backends.  ``f₁(k=0) = 0`` is the rigid-body harmonic — expected
for a free-floating cyclic rotor at :math:`k = 0`.

Headlines
~~~~~~~~~

* **PRIMME with OPinv is the right choice for cyclic-modal**
  sweeps.  The 2.5× speedup is the JDQMR algorithm correctly
  amortising the factor across the 11-harmonic ladder via
  v0-warm-start chaining (``solve_cyclic`` passes each harmonic's
  eigenvectors as the next harmonic's seed).
* **The linear backend is a wash on this workload.**  All three
  direct solvers (``mumps``, ``pardiso``, ``mkl_direct``) land
  within 3 % of each other on both ARPACK and PRIMME paths —
  the per-harmonic factor cost is dominated by the same MUMPS /
  MKL Pardiso symbolic+numeric, and the wrapper overhead is in
  the noise.
* **PRIMME memory cost: +1.2 GB (3.6 %)** over ARPACK.  The
  block-Davidson workspace + soft-locking buffers add a few
  hundred MB per harmonic; well worth the 2× wall reduction.

Why PRIMME wins here but loses on the simple modal path
-------------------------------------------------------

The simple modal entry point's autotune deliberately picks
ARPACK now (see :doc:`pbs-rotor-comparison` for the full
rationale + bench).  PRIMME without an OPinv falls over with
matvec error -41 on the same 500k-DOF K.

The cyclic path sidesteps the issue by always providing OPinv
built from the chosen linear backend
(``femorph_solver.solvers.cyclic.solve_cyclic`` → line 1482 of
``cyclic.py``).  PRIMME never has to fall back to its own
iterative inner solve, so the matvec error never arises.

Recommendation: install PRIMME (``pip install primme``, or use
the femorph-built wheels from ``femorph/pprime`` documented in
:doc:`solver-defaults-and-threading`)
**when you run cyclic-modal analyses on rotor decks**.  The 2.5×
speedup at the cost of +3.6 % memory is the best ROI in the
solver stack for that workload.

References
----------

* Bench script: :file:`perf/bench_pbs_cyclic.py`
* Cyclic-symmetry modal API: :func:`femorph_solver.solvers.cyclic.solve_cyclic`
* PRIMME paper (JDQMR + GD+k restart): Stathopoulos & McCombs
  (2010), `ACM Trans. Math. Softw.`, vol 37 (2), 21:1–21:30.