Performance#

This page summarises where wall-time and memory go in femorph-solver’s modal + static pipelines and how each landing has pushed them down. femorph-solver-only numbers are reproducible from the benchmarks in the perf/ directory; the full changelog (including older rows and a historical snapshot of femorph-solver vs Ansys MAPDL v22.2 captured by the maintainer on a licensed install) lives in PERFORMANCE.md at the repository root. MAPDL is used there as a gold-standard reference only — this repository does not contain code that runs MAPDL.

Hardware and reproducers#

  • CPU: Intel Core i9-14900KF

  • OS: Linux 6.12.66 (NixOS)

  • Python 3.13.9, NumPy + SciPy on OpenBLAS (upstream wheels)

  • OMP_NUM_THREADS=4 for end-to-end runs

For end-to-end rotor wallclock on a production-style mixed-topology deck, see End-to-end PTR modal wallclock below.

Reproducers:

# micro-benchmarks (per-element kernels + plate assembly)
pytest tests/benchmarks --benchmark-only -o python_files=bench_*.py

# end-to-end modal pipeline sweep (femorph-solver only)
PYTHONPATH=. .venv/bin/python perf/bench_pipeline.py

# linear + eigen backend comparison
PYTHONPATH=. .venv/bin/python perf/bench_solvers.py --size 60x60x2

# strain post-processing + OpenMP scaling
OMP_NUM_THREADS=N PYTHONPATH=. .venv/bin/python perf/bench_strain.py

Progression at a glance#

Every row is a landing that moved the modal-pipeline numbers. Micro columns are pytest tests/benchmarks medians on the 40×40×2 plate (15 129 DOF); end-to-end columns come from perf/bench_pipeline.py on the 100×100×2 plate (91 809 DOF).

#

Landing (date)

Headline change

assemble_k

assemble_m

modal (10 mds)

40² peak RSS

100² wall

0

Baseline (2026-04-20)

pure Python BᵀCB

619 ms

420 ms

1.89 s

1

SOLID185 C++ (2026-04-20)

nanobind per-element Ke/Me

94 ms

85 ms

1.34 s

2

SOLID186 + 187 C++ (2026-04-21)

all 3D solids in C++

88.7 ms

84.7 ms

1.29 s

3

Batch + vectorised (2026-04-21)

group-batched kernel + COO broadcast

58.0 ms

54.0 ms

918 ms

411 MB

13.21 s

4

CSR-direct C++ (2026-04-21)

build_csr_pattern + scatter_into_csr

16.2 ms

12.5 ms

621 ms

337 MB

5

Auto-CHOLMOD (2026-04-21)

linear_solver="auto" → supernodal Cholesky

≈16 ms

≈13 ms

247 ms

289 MB

2.29 s

6

Parallel assembly (2026-04-24)

group-by-material-value + OMP scatter + unified C++ pattern

≈18 ms

≈20 ms

≈240 ms

289 MB

≈2.2 s

Δ row 5 vs row 0

×39

×34

×7.7

After row 5 the modal pipeline is no longer assembly-bound or solver-bound in Python — ~80 % of wall lives inside CHOLMOD factorisation + ARPACK orthogonalisation (BLAS/LAPACK).

Row 6’s headline gains do not show up on the single-material 40×40×2 plate — that mesh is already one element group (all cells share a single mat_id + real_id), so the mat_id fragmentation fix in row 6 is a no-op and the pattern is cached between K and M. The win appears on mixed-topology decks with per-cell ``mat_id`` like the Purdue Transonic Rotor (ptr.cdb — 57 k DOFs, 5166 cells across SOLID185/186/186W/186P/187, one mat_id per cell), measured in Thread scaling on a mixed-topology deck (PTR rotor) below.

Head-to-head vs MAPDL (2026-04-21)#

Full per-stage comparison on the SOLID185 plate, 10-mode modal solve, swept from 1 k to ~1.36 M DOF. MAPDL v22.2 runs the same mesh (emitted as a CDB via mapdl_archive) in Docker with -smp on 4 threads.

Mesh

DOF

femorph-solver wall

MAPDL wall

Speedup

femorph-solver peak

MAPDL peak

10×10×2

1 089

0.015 s

3.68 s

×250

178 MB

2 112 MB

40×40×2

15 129

0.249 s

4.04 s

×16

290 MB

2 112 MB

100×100×2

91 809

2.293 s

6.45 s

×2.8

938 MB

2 624 MB

150×150×4

342 015

11.59 s

19.45 s

×1.7

3 882 MB

6 011 MB

200×200×4

606 015

21.52 s

33.87 s

×1.6

7 002 MB

10 019 MB

300×300×4

1 359 015

52.64 s

85.85 s

×1.6

16 065 MB

19 851 MB

Observations#

  • Wall-time: femorph-solver finishes below MAPDL across the sweep on this benchmark. The ratio settles at ~×1.6 from 200 k DOF upward — both codes are then bound by the BLAS cost of sparse Cholesky factorisation, so further narrowing stops.

  • Assembly: femorph-solver’s K+M takes 3.2×–6.0× less wall-time than MAPDL’s element-matrix CP block here. At 100×100×2 femorph-solver is 0.32 s vs MAPDL 1.91 s; at 300×300×4 it’s 5.33 s vs 17.05 s.

  • Eigensolve: femorph-solver’s shift-invert runs 1.37×–1.72× shorter, and the gap widens with N. CHOLMOD’s AMD ordering produces progressively less fill than MAPDL’s LANB block-Lanczos factor on these regular meshes.

  • Peak memory: femorph-solver’s RSS is 0.08× MAPDL’s at 10×10×2 rising to 0.81× at 300×300×4. MAPDL carries a ~2.1 GB floor of COMMON blocks + scratch; as N grows both codes converge on “factor fill-in”.

  • First-frequency agreement drifts with mesh refinement up to 100×100×2, then converges again at nz=4. Δf₁ stays < 2 % for well-meshed cases; the elevated 6.8 % at 100×100×2 is a single-element-thick bending-dominated plate physics artefact (different ESF / shear defaults between the two SOLID185 formulations), not a solver bug.

Thread scaling on a mixed-topology deck (PTR rotor)#

The Purdue Transonic Rotor (ptr.cdb — 18 954 nodes, 5 166 cells, 56 862 DOFs) exercises every assembly path the flat plate does not: four element types in one mesh (3069 SOLID186 hex, 497 SOLID186 pyramid, 19 SOLID186 wedge, 1581 SOLID187 tet), per-cell mat_id (1..5166) all pointing at the same 304-stainless material, and a real-world factor that hits the scipy multi-group pattern fallback.

Measurements below are cold stiffness_matrix / mass_matrix calls on an Intel Core i9-14900KF, pattern cache invalidated before each run. The “baseline (main)” column is main immediately before the T4-P11 landing — after the #102 / #103 / #104 perf series already in place — so the delta reflects only the T4-P11 work, not the earlier memory-reduction chain.

Stage / thread count

T=1

T=4

T=8

baseline (main, T=1/4/8)

Δ @ T=8

stiffness_matrix (cold)

322 ms

182 ms

127 ms

648 / 440 / 364 ms

×2.9

mass_matrix (cold)

228 ms

152 ms

83 ms

342 / 305 / 227 ms

×2.7

Scaling from T=1 to T=8 on the branch is ×2.5 on K and ×2.7 on M — bounded by the pattern build (which is partially serial in Step A: the row-to-element inverse index) and the memory-bandwidth ceiling on the CSR scatter. Further wins would require either a parallel Step A, larger meshes that amortise OMP region overhead better, or a cache-tiled scatter.

Three independent fixes compose:

  1. Group by material value, not mat_id. Keying Model._assemble’s per-group loop on the hashed material tuple collapses 5166 singleton groups (one per mat_id) back down to the 4 real element-type groups, restoring ke_batch / me_batch batching on decks that assign a unique mat_id per cell. ~2× on its own, single-threaded.

  2. Unified C++ pattern builder for mixed-topology meshes. _build_csr_pattern pads each group’s DOF array to the max ndof_per_elem with -1 sentinels and dispatches to the single-group C++ builder, which already ignores -1 entries. Eliminates the scipy coo_array tocsr fallback that had been materialising ~1.5 M COO entries and deduplicating them Python-side.

  3. OpenMP on the hot loops. scatter_into_csr is element-parallel with #pragma omp atomic on shared CSR slots. build_csr_pattern Steps B and C run row-parallel with a per-thread marker buffer (classic FE row-colouring). Numerically bit-exact against the serial build on PTR — atomic adds preserve the output despite non-deterministic ordering.

Reproducer:

pytest tests/benchmarks/bench_assembly.py \\
    --benchmark-only -o python_files=bench_*.py \\
    -k 'ptr_thread_scaling'

Strain post-processing (Model.eel)#

The elastic-strain feature (MAPDL S,EPEL equivalent) evaluates ε(ξ_node) = B(ξ_node) · u_e per element and optionally averages to the nodes (PLNSOL style). The batch kernel is OpenMP-parallel across elements.

Raw C++ strain kernel scaling (SOLID185 hex plate, median of 3 runs after warm-up):

Mesh

n_elem

T=1

T=2

T=4

T=8

T=16

T=16 vs T=1

50×50×2

5 000

1.5 ms

0.9 ms

0.5 ms

0.3 ms

0.1 ms

×15

100×100×2

20 000

5.2 ms

2.7 ms

2.0 ms

1.0 ms

0.5 ms

×10

150×150×4

90 000

32.4 ms

16.9 ms

9.6 ms

6.4 ms

4.3 ms

×7.5

200×200×4

160 000

58.0 ms

31.0 ms

18.2 ms

14.1 ms

7.8 ms

×7.4

Full Model.eel wall (includes Python-side nodal averaging via np.searchsorted + np.add.at):

Mesh

n_dof

T=1 avg

T=2 avg

T=4 avg

T=8 avg

T=16 avg

50×50×2

23 409

9.1 ms

8.4 ms

8.2 ms

8.0 ms

8.0 ms

100×100×2

91 809

36.7 ms

34.9 ms

33.2 ms

32.4 ms

33.4 ms

150×150×4

342 015

169.9 ms

150.9 ms

145.0 ms

142.6 ms

143.0 ms

200×200×4

606 015

308.9 ms

287.2 ms

280.5 ms

285.9 ms

292.5 ms

Threads help the strain kernel itself (~×7 at T=16 on 200×200×4) but don’t yet move the needle on Model.eel end-to-end — at T=1 on 200×200×4 the C++ batch is 58 ms out of 309 ms total, so the remaining ~250 ms lives in the unparallelised Python scatter/average. Vectorising or moving the nodal-average reduction to C++ is the next opportunity.

End-to-end PTR modal wallclock#

The Purdue Transonic Rotor represents the production-style modal workload: 57 k DOFs, 5166 cells, mixed SOLID185/186/186W/186P/187 topology, per-cell mat_id (5166 distinct IDs all pointing at the same 304 stainless material). Running solve_modal(..., n_modes=12, eigen_solver="arpack", linear_solver="auto") at OMP_NUM_THREADS=4 on the hardware above:

Stack tip

Modal wallclock

Landings in play

2026-04-23 morning

~14.6 s

Pre-Pardiso-autoresolver; SuperLU fallback.

2026-04-23 evening

~4.6 s

Pardiso autoresolved; pre-T4-P11.

2026-04-24 (current)

~3.0 s

#110 (parallel assembly) + #112 (Pardiso ia/ja cache) + #87/#88/#95/#97/#98/#109 memory chain.

The sub-second breakdown at 2026-04-24 is ~0.5 s total on the assembly + BC-reduce side (down from ~1.5 s pre-#110), ~1 s on the Pardiso factor, and ~1.5 s across ARPACK’s ~40 shift-invert iterations. The remaining lever is the per-iteration cost — see the open “perf” PRs at any given time for incremental work against it.

Reproducer:

pytest tests/benchmarks/bench_solve.py::test_modal_solve_ptr_12_modes \
    --benchmark-only -o python_files=bench_*.py

PR #101 → #116 progression — hashed by commit#

For developers auditing where each landing moved the needle.

The table below is generated by walking every merge commit on main between PR #101 (the post-#94/#97/#98 memory-push snapshot) and PR #116 (pardiso _check_A / _check_b skip), rebuilding the native extension at each step via uv pip install --no-build-isolation -e . and re-running the pipeline bench (3 mesh sizes, modal solve, OMP_NUM_THREADS=4, P-core-pinned). The walker lives at perf/walk_history.py; each per-commit JSON sits at perf/snapshots/walk_<hash>.json so the progression can be regenerated without re-walking history.

What to read. Wall-time and peak RSS are per-size. Δrel vs #101 is the relative deviation of the first natural frequency against the earliest-commit value — the column that matters for the recurring “did accuracy get worse?” question.

80×80×2 (215 k DOFs) — where the progression is legible

PR

commit

wall (s)

peak MB

f1 (Hz)

Δrel vs #101

#101

ee9645a

3.87

521.7

10.14569036801

0

#102

df08aba

2.72

512.1

10.14569037717

9.0e-10

#103

4c32785

2.75

491.4

10.14569037486

6.8e-10

#104

5d0b452

2.61

489.7

10.14569036680

1.2e-10

#109

1787e1e

2.61

489.2

10.14569037048

2.4e-10

#110

ae29ac6

3.29

489.9

10.14569038142

1.3e-9

#112

87e6f82

2.26

483.6

10.14569037848

1.0e-9

#115

304cb89

2.52

483.5

10.14569037469

6.6e-10

#116

1b3e144

2.48

483.0

10.14569036991

1.9e-10

#113

8652ace

2.57

483.0

10.14569036848

4.6e-11

48×48×2 (78 k DOFs)

PR

commit

wall (s)

peak MB

f1 (Hz)

Δrel vs #101

#101

ee9645a

0.76

304.2

13.19744738477

0

#102

df08aba

0.92

300.8

13.19744738477

6.1e-15

#103

4c32785

0.88

300.7

13.19744738477

7.0e-15

#104

5d0b452

0.66

297.7

13.19744738477

6.2e-15

#109

1787e1e

0.69

297.4

13.19744738477

4.0e-15

#110

ae29ac6

0.66

296.9

13.19744737985

3.7e-10

#112

87e6f82

0.80

298.3

13.19744738396

6.2e-11

#115

304cb89

0.84

297.9

13.19744738829

2.7e-10

#116

1b3e144

0.81

297.3

13.19744738594

8.9e-11

#113

8652ace

0.73

297.8

13.19744738653

1.3e-10

32×32×2 (35 k DOFs) — Δrel dominated by machine precision

At this size the first mode converges fully; Δrel stays below 1e-9 and the jumps align cleanly with ARPACK’s stochastic convergence floor. See perf/snapshots/walk_*.json for the full table.

Ordered by magnitude of wall-time or RSS delta at 80×80×2:

  • #102 (refactor(model): grid as sole source of truth) — wall 3.87 → 2.72 s (−30 %), RSS 521 → 512 MB. Eliminates parallel _nodes / _elements dicts that mirrored the pyvista grid. The dict→grid refactor removed one O(n_nodes) Python walk per assembly call.

  • #103 (pardiso size_limit_storage=0) — RSS 512 → 491 MB (−4 %). Stops pypardiso’s internal full-A.copy() cache.

  • #104 (malloc_trim(0) at BC boundary) — wall 2.75 → 2.61 s (−5 %). Returns glibc arena scratch between assembly and factor. Bigger wins at larger scales (>100 k DOFs) where the element-batch transient is larger.

  • #109 (drop DOF layout caches) — no measurable change at 80×80×2. Scales with node count; modeled saving was ~0.6 KB per node, dominated by the Pardiso factor ceiling at this mesh.

  • #110 (parallel C++ K/M assembly) — wall 2.61 → 3.29 s at this size. Caveat: scikit-build-core does not always re-compile the C extension between walker checkouts (build cached at ~0.9 s); the number above may reflect the last-built .so rather than #110’s actual parallel path. On PTR rotor-scale meshes (mixed-topology) the PR’s own numbers show 6× speedup at 8 threads. Re-run the walker after rm -rf build/ for a clean per-commit measurement.

  • #112 (pardiso 1-based ia/ja cached) — wall 3.29 → 2.26 s (−31 % vs the just-prior sample). Eliminates ~65 MB of alloc/free per solve × ~30 solves per modal_solve.

  • #115 (vectorize _assemble layout lookup) — measured as slightly slower at 80×80×2 here; the win shows cleanly in the assembly-only bench (_bc_reduce_and_release() is 25-40 % faster at mid-ladder per the PR #115 measurements) but the 3-second end-to-end pipeline is dominated by Pardiso and eigsh.

  • #116 (pardiso skip _check_A/_check_b) — modest; removes per-solve O(n) indptr scan. Wins scale with solve count.

  • #113 (single-pass symmetric matvec from triu CSR) — wall at this size is ~flat vs #116; on PTR rotor-scale where M @ x dominates eigsh iteration cost the PR reports clearer wins.

No. First-frequency Δrel vs the #101 baseline stays in the 1e-151e-9 range across the whole walk — well below the ARPACK tol=1e-12 convergence floor, which on a plate modal problem gives roughly 1e-7 of stochastic first-mode drift between independent solver invocations. The jumps visible at individual commits (for example #110 at 32×32×2 moving from 8e-16 to 4.5e-11) are ARPACK’s iteration count shifting by one or two, not a numerical regression; the \(\Delta_{rel}\) value oscillates rather than drifting monotonically toward larger error.

If an MAPDL-comparison snapshot (see PERFORMANCE.md) shows a larger Δrel at the top of its size ladder — for example 2e-7 at 224×224×2 — that is the ARPACK-vs-Lanczos gap against the MAPDL reference, not a progression regression: the absolute value sits at the ARPACK tolerance floor for every post-#101 commit on the walker above.

# Walk the default commit set (above) and regenerate snapshots.
python perf/walk_history.py

# Narrower walk, larger mesh ladder.
python perf/walk_history.py \
    --commits 5d0b452 ae29ac6 87e6f82 1b3e144 \
    --sizes 64x64x2 96x96x2 128x128x2

Each commit writes perf/snapshots/walk_<hash>.json (skipped if already present so reruns are incremental). The walker auto- stashes local edits and restores the starting branch on exit, so it’s safe to run from any branch.

The PERFORMANCE.md changelog#

This page is a curated user-facing view. For the full dated progression of every landing — including per-element micro numbers, older pipeline sweeps, and the exact commit SHAs — see PERFORMANCE.md in the repository root.