PBS rotor — cyclic-modal sweep at 8 threads#

Companion to PBS rotor — solver head-to-head at 8 threads — the same PBS sector, but solved as a cyclic-symmetry modal sweep (full harmonic ladder \(k = 0, 1, \ldots, N/2\)) instead of a single non-cyclic modal. This is the workload that actually motivates installing PRIMME.

The 20-sector PBS rotor closes at \(N = 20\), so the sweep covers 11 harmonics (\(k = 0, 1, \ldots, 10\)). Each harmonic solves a complex Hermitian generalised eigenproblem of the same ~half-million-DOF size as the simple-modal page; the linear backend’s factor is reused across harmonics, so the eigensolver’s restart efficiency dominates total wall time.

Workload#

field	value
deck	`pbs-sector-hd.cdb` (single sector)
sectors N	20 (auto-detected)
harmonics solved	11 — full sweep k = 0..10
modes per k	4
nodes	164,922
elements	41,084
total DOFs	494,766
K nnz	75,362,094
paired interfaces	2,005 nodes
material	steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³

The mesh image is the same as in PBS rotor — solver head-to-head at 8 threads.

Bench harness#

perf/bench_pbs_cyclic.py drives solve_cyclic over all viable backend × eigen combinations at 8 threads. Same env-var seed pattern as the simple-modal bench (OMP_NUM_THREADS = 8 before numpy import); peak RSS via getrusage; per-backend wall budget 1800 s.

Eigen backends in cyclic:

scipy — routes to ARPACK (scipy.sparse.linalg.eigsh) on the dense-Hermitian per-harmonic eigensolve.
primme — primme.eigsh with OPinv built from the chosen linear backend (i.e. PRIMME calls (K_red - σ M_red)^{-1} through Pardiso/MUMPS/etc.). This is the path where PRIMME’s JDQMR algorithm pays off and where the simple-modal wrapper’s matvec-error issue (PBS rotor — solver head-to-head at 8 threads) doesn’t show up.

MAPDL parity at strict 8-thread cap#

For a fair head-to-head we ran MAPDL Block Lanczos on the same sector with the same cyclic boundary conditions, the same 11 harmonics, 4 modes per harmonic, -np 8 SMP under taskset -c 0-7 (the default in ansys-docker/start-ansys.sh on this host). Then we re-ran the femorph PRIMME+``mkl_direct`` combo under the same taskset -c 0-7 so neither side can oversubscribe past 8 cores.

Solver	Wall (s)	Peak RSS (MiB)	Ratio vs MAPDL	Notes
MAPDL 22.2 LANB ``-np 8``	787.5	24,506	1.00× (baseline)	SMP mode (no MPI in this image); CP time 5045 s ⇒ ~6.4× SMP scaling
femorph PRIMME + ``mkl_direct`` (taskset -c 0-7)	631.7	24,642	0.80× (1.25× faster than MAPDL)	23 threads pile-up onto 8 cores; MUMPS / MKL / OpenMP runtimes don’t all honour `OMP_NUM_THREADS=8` so context-switch overhead is real

So femorph beats MAPDL by 1.25× on the cyclic-modal sweep at the same strict 8-thread cap, with essentially the same peak memory (~24.5 GB on both sides). The PRIMME JDQMR algorithm + a sparse direct factor (Pardiso / MUMPS / MKL Direct) really is the right tool for this workload.

Thread-cap honesty note#

The unconstrained femorph bench (env vars only, no taskset) finishes in 447 s on the same host — but that’s because MUMPS’ libmumps internal pool, MKL Pardiso’s worker pool, and SciPy’s BLAS pool each spawn their own threads on top of the OMP_NUM_THREADS=8 env cap. Sampling /proc/self/status mid- solve shows Threads: 23 even with all the standard env vars set to 8. taskset -c 0-7 is the only mechanically-enforced 8-core cap on this host.

If your workload is the only thing on the box → the 447 s number is what you’ll see.
If you share the host (CI, multi-tenant, batch system) → taskset -c 0-N and budget ~40 % overhead vs the unconstrained number.

MAPDL is fairer about its -np 8 flag — its summed-thread CP time (5045 s) ÷ wall (787 s) gives 6.4× effective parallelism, consistent with 8 threads bounded to 8 cores with normal serial-section overhead.

Results#

Backend	Eigen × Linear	Wall (s)	Peak RSS (MiB)	Speedup vs ARPACK	Notes
femorph-solver	ARPACK + `mumps`	1120.78	33,012	1.00× (baseline)	reference
femorph-solver	ARPACK + `pardiso`	1144.59	33,012	0.98×	same factor cost; minor wrapper overhead
femorph-solver	ARPACK + `mkl_direct`	1148.60	33,012	0.98×	same factor cost; equivalent to pardiso here
femorph-solver	PRIMME + ``mumps``	514.35	34,198	2.18×	JDQMR amortises factor across harmonics
femorph-solver	PRIMME + ``pardiso``	447.81	34,198	2.50×	second-fastest
femorph-solver	PRIMME + ``mkl_direct``	447.00	34,198	2.51×	fastest combination

f₁ at every harmonic agrees to ≥ 6 digits across all six backends. f₁(k=0) = 0 is the rigid-body harmonic — expected for a free-floating cyclic rotor at \(k = 0\).

Headlines#

PRIMME with OPinv is the right choice for cyclic-modal sweeps. The 2.5× speedup is the JDQMR algorithm correctly amortising the factor across the 11-harmonic ladder via v0-warm-start chaining (solve_cyclic passes each harmonic’s eigenvectors as the next harmonic’s seed).
The linear backend is a wash on this workload. All three direct solvers (mumps, pardiso, mkl_direct) land within 3 % of each other on both ARPACK and PRIMME paths — the per-harmonic factor cost is dominated by the same MUMPS / MKL Pardiso symbolic+numeric, and the wrapper overhead is in the noise.
PRIMME memory cost: +1.2 GB (3.6 %) over ARPACK. The block-Davidson workspace + soft-locking buffers add a few hundred MB per harmonic; well worth the 2× wall reduction.

References#

Bench script: perf/bench_pbs_cyclic.py
Cyclic-symmetry modal API: femorph_solver.solvers.cyclic.solve_cyclic()
PRIMME paper (JDQMR + GD+k restart): Stathopoulos & McCombs (2010), ACM Trans. Math. Softw., vol 37 (2), 21:1–21:30.