PBS rotor — cyclic-modal sweep at 8 threads#
Companion to PBS rotor — solver head-to-head at 8 threads — the same PBS sector, but solved as a cyclic-symmetry modal sweep (full harmonic ladder \(k = 0, 1, \ldots, N/2\)) instead of a single non-cyclic modal. This is the workload that actually motivates installing PRIMME.
The 20-sector PBS rotor closes at \(N = 20\), so the sweep covers 11 harmonics (\(k = 0, 1, \ldots, 10\)). Each harmonic solves a complex Hermitian generalised eigenproblem of the same ~half-million-DOF size as the simple-modal page; the linear backend’s factor is reused across harmonics, so the eigensolver’s restart efficiency dominates total wall time.
Workload#
field |
value |
|---|---|
deck |
|
sectors N |
20 (auto-detected) |
harmonics solved |
11 — full sweep k = 0..10 |
modes per k |
4 |
nodes |
164,922 |
elements |
41,084 |
total DOFs |
494,766 |
K nnz |
75,362,094 |
paired interfaces |
2,005 nodes |
material |
steel — E=200 GPa, ν=0.3, ρ=7850 kg/m³ |
The mesh image is the same as in PBS rotor — solver head-to-head at 8 threads.
Bench harness#
perf/bench_pbs_cyclic.py drives solve_cyclic over all
viable backend × eigen combinations at 8 threads. Same env-var
seed pattern as the simple-modal bench (OMP_NUM_THREADS = 8
before numpy import); peak RSS via getrusage; per-backend wall
budget 1800 s.
Eigen backends in cyclic:
scipy — routes to ARPACK (
scipy.sparse.linalg.eigsh) on the dense-Hermitian per-harmonic eigensolve.primme —
primme.eigshwith OPinv built from the chosen linear backend (i.e. PRIMME calls(K_red - σ M_red)^{-1}through Pardiso/MUMPS/etc.). This is the path where PRIMME’s JDQMR algorithm pays off and where the simple-modal wrapper’s matvec-error issue (PBS rotor — solver head-to-head at 8 threads) doesn’t show up.
MAPDL parity at strict 8-thread cap#
For a fair head-to-head we ran MAPDL Block Lanczos on the same
sector with the same cyclic boundary conditions, the same 11
harmonics, 4 modes per harmonic, -np 8 SMP under
taskset -c 0-7 (the default in ansys-docker/start-ansys.sh
on this host). Then we re-ran the femorph PRIMME+``mkl_direct``
combo under the same taskset -c 0-7 so neither side can
oversubscribe past 8 cores.
Solver |
Wall (s) |
Peak RSS (MiB) |
Ratio vs MAPDL |
Notes |
|---|---|---|---|---|
MAPDL 22.2 LANB ``-np 8`` |
787.5 |
24,506 |
1.00× (baseline) |
SMP mode (no MPI in this image); CP time 5045 s ⇒ ~6.4× SMP scaling |
femorph PRIMME + ``mkl_direct`` (taskset -c 0-7) |
631.7 |
24,642 |
0.80× (1.25× faster than MAPDL) |
23 threads pile-up onto 8 cores; MUMPS / MKL / OpenMP runtimes don’t all honour |
So femorph beats MAPDL by 1.25× on the cyclic-modal sweep at the same strict 8-thread cap, with essentially the same peak memory (~24.5 GB on both sides). The PRIMME JDQMR algorithm + a sparse direct factor (Pardiso / MUMPS / MKL Direct) really is the right tool for this workload.
Thread-cap honesty note#
The unconstrained femorph bench (env vars only, no taskset)
finishes in 447 s on the same host — but that’s because
MUMPS’ libmumps internal pool, MKL Pardiso’s worker pool, and
SciPy’s BLAS pool each spawn their own threads on top of the
OMP_NUM_THREADS=8 env cap. Sampling /proc/self/status mid-
solve shows Threads: 23 even with all the standard env vars
set to 8. taskset -c 0-7 is the only mechanically-enforced
8-core cap on this host.
If your workload is the only thing on the box → the 447 s number is what you’ll see.
If you share the host (CI, multi-tenant, batch system) →
taskset -c 0-Nand budget ~40 % overhead vs the unconstrained number.
MAPDL is fairer about its -np 8 flag — its summed-thread CP
time (5045 s) ÷ wall (787 s) gives 6.4× effective parallelism,
consistent with 8 threads bounded to 8 cores with normal
serial-section overhead.
Results#
Backend |
Eigen × Linear |
Wall (s) |
Peak RSS (MiB) |
Speedup vs ARPACK |
Notes |
|---|---|---|---|---|---|
femorph-solver |
ARPACK + |
1120.78 |
33,012 |
1.00× (baseline) |
reference |
femorph-solver |
ARPACK + |
1144.59 |
33,012 |
0.98× |
same factor cost; minor wrapper overhead |
femorph-solver |
ARPACK + |
1148.60 |
33,012 |
0.98× |
same factor cost; equivalent to pardiso here |
femorph-solver |
PRIMME + ``mumps`` |
514.35 |
34,198 |
2.18× |
JDQMR amortises factor across harmonics |
femorph-solver |
PRIMME + ``pardiso`` |
447.81 |
34,198 |
2.50× |
second-fastest |
femorph-solver |
PRIMME + ``mkl_direct`` |
447.00 |
34,198 |
2.51× |
fastest combination |
f₁ at every harmonic agrees to ≥ 6 digits across all six
backends. f₁(k=0) = 0 is the rigid-body harmonic — expected
for a free-floating cyclic rotor at \(k = 0\).
Headlines#
PRIMME with OPinv is the right choice for cyclic-modal sweeps. The 2.5× speedup is the JDQMR algorithm correctly amortising the factor across the 11-harmonic ladder via v0-warm-start chaining (
solve_cyclicpasses each harmonic’s eigenvectors as the next harmonic’s seed).The linear backend is a wash on this workload. All three direct solvers (
mumps,pardiso,mkl_direct) land within 3 % of each other on both ARPACK and PRIMME paths — the per-harmonic factor cost is dominated by the same MUMPS / MKL Pardiso symbolic+numeric, and the wrapper overhead is in the noise.PRIMME memory cost: +1.2 GB (3.6 %) over ARPACK. The block-Davidson workspace + soft-locking buffers add a few hundred MB per harmonic; well worth the 2× wall reduction.
Why PRIMME wins here but loses on the simple modal path#
The simple modal entry point’s autotune deliberately picks ARPACK now (see PBS rotor — solver head-to-head at 8 threads for the full rationale + bench). PRIMME without an OPinv falls over with matvec error -41 on the same 500k-DOF K.
The cyclic path sidesteps the issue by always providing OPinv
built from the chosen linear backend
(femorph_solver.solvers.cyclic.solve_cyclic → line 1482 of
cyclic.py). PRIMME never has to fall back to its own
iterative inner solve, so the matvec error never arises.
Recommendation: install PRIMME (pip install primme, or use
the femorph-built wheels from femorph/pprime documented in
Solver defaults and threading)
when you run cyclic-modal analyses on rotor decks. The 2.5×
speedup at the cost of +3.6 % memory is the best ROI in the
solver stack for that workload.
References#
Bench script:
perf/bench_pbs_cyclic.pyCyclic-symmetry modal API:
femorph_solver.solvers.cyclic.solve_cyclic()PRIMME paper (JDQMR + GD+k restart): Stathopoulos & McCombs (2010), ACM Trans. Math. Softw., vol 37 (2), 21:1–21:30.