PBS rotor — cyclic-modal sweep at 8 threads#

Companion to PBS rotor — solver head-to-head at 8 threads — the same PBS sector, but solved as a cyclic-symmetry modal sweep (full harmonic ladder \(k = 0, 1, \ldots, N/2\)) instead of a single non-cyclic modal. This is the workload that actually motivates installing PRIMME.

The 20-sector PBS rotor closes at \(N = 20\), so the sweep covers 11 harmonics (\(k = 0, 1, \ldots, 10\)). Each harmonic solves a complex Hermitian generalised eigenproblem of the same ~half-million-DOF size as the simple-modal page; the linear backend’s factor is reused across harmonics, so the eigensolver’s restart efficiency dominates total wall time.

Workload#

field

value

deck

pbs-sector-hd.cdb (single sector)

sectors N

20 (auto-detected)

harmonics solved

11 — full sweep k = 0..10

modes per k

4

nodes

164,922

elements

41,084

total DOFs

494,766

K nnz

75,362,094

paired interfaces

2,005 nodes

material

steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³

The mesh image is the same as in PBS rotor — solver head-to-head at 8 threads.

Bench harness#

perf/bench_pbs_cyclic.py drives solve_cyclic over all viable backend × eigen combinations at 8 threads. Same env-var seed pattern as the simple-modal bench (OMP_NUM_THREADS = 8 before numpy import); peak RSS via getrusage; per-backend wall budget 1800 s.

Eigen backends in cyclic:

  • scipy — routes to ARPACK (scipy.sparse.linalg.eigsh) on the dense-Hermitian per-harmonic eigensolve.

  • primmeprimme.eigsh with OPinv built from the chosen linear backend (i.e. PRIMME calls (K_red - σ M_red)^{-1} through Pardiso/MUMPS/etc.). This is the path where PRIMME’s JDQMR algorithm pays off and where the simple-modal wrapper’s matvec-error issue (PBS rotor — solver head-to-head at 8 threads) doesn’t show up.

MAPDL parity at strict 8-thread cap#

For a fair head-to-head we ran MAPDL Block Lanczos on the same sector with the same cyclic boundary conditions, the same 11 harmonics, 4 modes per harmonic, -np 8 SMP under taskset -c 0-7 (the default in ansys-docker/start-ansys.sh on this host). Then we re-ran the femorph PRIMME+``mkl_direct`` combo under the same taskset -c 0-7 so neither side can oversubscribe past 8 cores.

Solver

Wall (s)

Peak RSS (MiB)

Ratio vs MAPDL

Notes

MAPDL 22.2 LANB ``-np 8``

787.5

24,506

1.00× (baseline)

SMP mode (no MPI in this image); CP time 5045 s ⇒ ~6.4× SMP scaling

femorph PRIMME + ``mkl_direct`` (taskset -c 0-7)

631.7

24,642

0.80× (1.25× faster than MAPDL)

23 threads pile-up onto 8 cores; MUMPS / MKL / OpenMP runtimes don’t all honour OMP_NUM_THREADS=8 so context-switch overhead is real

So femorph beats MAPDL by 1.25× on the cyclic-modal sweep at the same strict 8-thread cap, with essentially the same peak memory (~24.5 GB on both sides). The PRIMME JDQMR algorithm + a sparse direct factor (Pardiso / MUMPS / MKL Direct) really is the right tool for this workload.

Thread-cap honesty note#

The unconstrained femorph bench (env vars only, no taskset) finishes in 447 s on the same host — but that’s because MUMPS’ libmumps internal pool, MKL Pardiso’s worker pool, and SciPy’s BLAS pool each spawn their own threads on top of the OMP_NUM_THREADS=8 env cap. Sampling /proc/self/status mid- solve shows Threads: 23 even with all the standard env vars set to 8. taskset -c 0-7 is the only mechanically-enforced 8-core cap on this host.

  • If your workload is the only thing on the box → the 447 s number is what you’ll see.

  • If you share the host (CI, multi-tenant, batch system) → taskset -c 0-N and budget ~40 % overhead vs the unconstrained number.

MAPDL is fairer about its -np 8 flag — its summed-thread CP time (5045 s) ÷ wall (787 s) gives 6.4× effective parallelism, consistent with 8 threads bounded to 8 cores with normal serial-section overhead.

Results#

Backend

Eigen × Linear

Wall (s)

Peak RSS (MiB)

Speedup vs ARPACK

Notes

femorph-solver

ARPACK + mumps

1120.78

33,012

1.00× (baseline)

reference

femorph-solver

ARPACK + pardiso

1144.59

33,012

0.98×

same factor cost; minor wrapper overhead

femorph-solver

ARPACK + mkl_direct

1148.60

33,012

0.98×

same factor cost; equivalent to pardiso here

femorph-solver

PRIMME + ``mumps``

514.35

34,198

2.18×

JDQMR amortises factor across harmonics

femorph-solver

PRIMME + ``pardiso``

447.81

34,198

2.50×

second-fastest

femorph-solver

PRIMME + ``mkl_direct``

447.00

34,198

2.51×

fastest combination

f₁ at every harmonic agrees to ≥ 6 digits across all six backends. f₁(k=0) = 0 is the rigid-body harmonic — expected for a free-floating cyclic rotor at \(k = 0\).

Headlines#

  • PRIMME with OPinv is the right choice for cyclic-modal sweeps. The 2.5× speedup is the JDQMR algorithm correctly amortising the factor across the 11-harmonic ladder via v0-warm-start chaining (solve_cyclic passes each harmonic’s eigenvectors as the next harmonic’s seed).

  • The linear backend is a wash on this workload. All three direct solvers (mumps, pardiso, mkl_direct) land within 3 % of each other on both ARPACK and PRIMME paths — the per-harmonic factor cost is dominated by the same MUMPS / MKL Pardiso symbolic+numeric, and the wrapper overhead is in the noise.

  • PRIMME memory cost: +1.2 GB (3.6 %) over ARPACK. The block-Davidson workspace + soft-locking buffers add a few hundred MB per harmonic; well worth the 2× wall reduction.

Why PRIMME wins here but loses on the simple modal path#

The simple modal entry point’s autotune deliberately picks ARPACK now (see PBS rotor — solver head-to-head at 8 threads for the full rationale + bench). PRIMME without an OPinv falls over with matvec error -41 on the same 500k-DOF K.

The cyclic path sidesteps the issue by always providing OPinv built from the chosen linear backend (femorph_solver.solvers.cyclic.solve_cyclic → line 1482 of cyclic.py). PRIMME never has to fall back to its own iterative inner solve, so the matvec error never arises.

Recommendation: install PRIMME (pip install primme, or use the femorph-built wheels from femorph/pprime documented in Solver defaults and threading) when you run cyclic-modal analyses on rotor decks. The 2.5× speedup at the cost of +3.6 % memory is the best ROI in the solver stack for that workload.

References#

  • Bench script: perf/bench_pbs_cyclic.py

  • Cyclic-symmetry modal API: femorph_solver.solvers.cyclic.solve_cyclic()

  • PRIMME paper (JDQMR + GD+k restart): Stathopoulos & McCombs (2010), ACM Trans. Math. Softw., vol 37 (2), 21:1–21:30.