Performance#

This page summarises where wall-time and memory go in femorph-solver’s modal + static pipelines and how each landing has pushed them down. femorph-solver-only numbers are reproducible from the benchmarks in the perf/ directory; the full changelog (including older rows and a historical snapshot of femorph-solver vs Ansys MAPDL v22.2 captured by the maintainer on a licensed install) lives in PERFORMANCE.md at the repository root. MAPDL is used there as a gold-standard reference only — this repository does not contain code that runs MAPDL.

Hardware and reproducers #

CPU: Intel Core i9-14900KF
OS: Linux 6.12.66 (NixOS)
Python 3.13.9, NumPy + SciPy on OpenBLAS (upstream wheels)
OMP_NUM_THREADS=4 for end-to-end runs

For end-to-end rotor wallclock on a production-style mixed-topology deck, see End-to-end PTR modal wallclock below.

Reproducers:

# micro-benchmarks (per-element kernels + plate assembly)
pytest tests/benchmarks --benchmark-only -o python_files=bench_*.py

# end-to-end modal pipeline sweep (femorph-solver only)
PYTHONPATH=. .venv/bin/python perf/bench_pipeline.py

# linear + eigen backend comparison
PYTHONPATH=. .venv/bin/python perf/bench_solvers.py --size 60x60x2

# strain post-processing + OpenMP scaling
OMP_NUM_THREADS=N PYTHONPATH=. .venv/bin/python perf/bench_strain.py

Progression at a glance #

Every row is a landing that moved the modal-pipeline numbers. Micro columns are pytest tests/benchmarks medians on the 40×40×2 plate (15 129 DOF); end-to-end columns come from perf/bench_pipeline.py on the 100×100×2 plate (91 809 DOF).

#	Landing (date)	Headline change	`assemble_k`	`assemble_m`	modal (10 mds)	40² peak RSS	100² wall
0	Baseline (2026-04-20)	pure Python `BᵀCB`	619 ms	420 ms	1.89 s	—	—
1	SOLID185 C++ (2026-04-20)	nanobind per-element Ke/Me	94 ms	85 ms	1.34 s	—	—
2	SOLID186 + 187 C++ (2026-04-21)	all 3D solids in C++	88.7 ms	84.7 ms	1.29 s	—	—
3	Batch + vectorised (2026-04-21)	group-batched kernel + COO broadcast	58.0 ms	54.0 ms	918 ms	411 MB	13.21 s
4	CSR-direct C++ (2026-04-21)	`build_csr_pattern` + `scatter_into_csr`	16.2 ms	12.5 ms	621 ms	337 MB	—
5	Auto-CHOLMOD (2026-04-21)	`linear_solver="auto"` → supernodal Cholesky	≈16 ms	≈13 ms	247 ms	289 MB	2.29 s
6	Parallel assembly (2026-04-24)	group-by-material-value + OMP scatter + unified C++ pattern	≈18 ms	≈20 ms	≈240 ms	289 MB	≈2.2 s
	Δ row 5 vs row 0		×39	×34	×7.7

After row 5 the modal pipeline is no longer assembly-bound or solver-bound in Python — ~80 % of wall lives inside CHOLMOD factorisation + ARPACK orthogonalisation (BLAS/LAPACK).

Row 6’s headline gains do not show up on the single-material 40×40×2 plate — that mesh is already one element group (all cells share a single mat_id + real_id), so the mat_id fragmentation fix in row 6 is a no-op and the pattern is cached between K and M. The win appears on mixed-topology decks with per-cell ``mat_id`` like the Purdue Transonic Rotor (ptr.cdb — 57 k DOFs, 5166 cells across SOLID185/186/186W/186P/187, one mat_id per cell), measured in Thread scaling on a mixed-topology deck (PTR rotor) below.

Head-to-head vs MAPDL (2026-04-21)#

Full per-stage comparison on the SOLID185 plate, 10-mode modal solve, swept from 1 k to ~1.36 M DOF. MAPDL v22.2 runs the same mesh (emitted as a CDB via mapdl_archive) in Docker with -smp on 4 threads.

Mesh	DOF	femorph-solver wall	MAPDL wall	Speedup	femorph-solver peak	MAPDL peak
10×10×2	1 089	0.015 s	3.68 s	×250	178 MB	2 112 MB
40×40×2	15 129	0.249 s	4.04 s	×16	290 MB	2 112 MB
100×100×2	91 809	2.293 s	6.45 s	×2.8	938 MB	2 624 MB
150×150×4	342 015	11.59 s	19.45 s	×1.7	3 882 MB	6 011 MB
200×200×4	606 015	21.52 s	33.87 s	×1.6	7 002 MB	10 019 MB
300×300×4	1 359 015	52.64 s	85.85 s	×1.6	16 065 MB	19 851 MB

Observations #

Wall-time: femorph-solver finishes below MAPDL across the sweep on this benchmark. The ratio settles at ~×1.6 from 200 k DOF upward — both codes are then bound by the BLAS cost of sparse Cholesky factorisation, so further narrowing stops.
Assembly: femorph-solver’s K+M takes 3.2×–6.0× less wall-time than MAPDL’s element-matrix CP block here. At 100×100×2 femorph-solver is 0.32 s vs MAPDL 1.91 s; at 300×300×4 it’s 5.33 s vs 17.05 s.
Eigensolve: femorph-solver’s shift-invert runs 1.37×–1.72× shorter, and the gap widens with N. CHOLMOD’s AMD ordering produces progressively less fill than MAPDL’s LANB block-Lanczos factor on these regular meshes.
Peak memory: femorph-solver’s RSS is 0.08× MAPDL’s at 10×10×2 rising to 0.81× at 300×300×4. MAPDL carries a ~2.1 GB floor of COMMON blocks + scratch; as N grows both codes converge on “factor fill-in”.
First-frequency agreement drifts with mesh refinement up to 100×100×2, then converges again at nz=4. Δf₁ stays < 2 % for well-meshed cases; the elevated 6.8 % at 100×100×2 is a single-element-thick bending-dominated plate physics artefact (different ESF / shear defaults between the two SOLID185 formulations), not a solver bug.

Thread scaling on a mixed-topology deck (PTR rotor)#

The Purdue Transonic Rotor (ptr.cdb — 18 954 nodes, 5 166 cells, 56 862 DOFs) exercises every assembly path the flat plate does not: four element types in one mesh (3069 SOLID186 hex, 497 SOLID186 pyramid, 19 SOLID186 wedge, 1581 SOLID187 tet), per-cell mat_id (1..5166) all pointing at the same 304-stainless material, and a real-world factor that hits the scipy multi-group pattern fallback.

Measurements below are cold stiffness_matrix / mass_matrix calls on an Intel Core i9-14900KF, pattern cache invalidated before each run. The “baseline (main)” column is main immediately before the T4-P11 landing — after the #102 / #103 / #104 perf series already in place — so the delta reflects only the T4-P11 work, not the earlier memory-reduction chain.

Stage / thread count	T=1	T=4	T=8	baseline (main, T=1/4/8)	Δ @ T=8
`stiffness_matrix` (cold)	322 ms	182 ms	127 ms	648 / 440 / 364 ms	×2.9
`mass_matrix` (cold)	228 ms	152 ms	83 ms	342 / 305 / 227 ms	×2.7

Scaling from T=1 to T=8 on the branch is ×2.5 on K and ×2.7 on M — bounded by the pattern build (which is partially serial in Step A: the row-to-element inverse index) and the memory-bandwidth ceiling on the CSR scatter. Further wins would require either a parallel Step A, larger meshes that amortise OMP region overhead better, or a cache-tiled scatter.

Three independent fixes compose:

Group by material value, not mat_id. Keying Model._assemble’s per-group loop on the hashed material tuple collapses 5166 singleton groups (one per mat_id) back down to the 4 real element-type groups, restoring ke_batch / me_batch batching on decks that assign a unique mat_id per cell. ~2× on its own, single-threaded.
Unified C++ pattern builder for mixed-topology meshes. _build_csr_pattern pads each group’s DOF array to the max ndof_per_elem with -1 sentinels and dispatches to the single-group C++ builder, which already ignores -1 entries. Eliminates the scipy coo_array → tocsr fallback that had been materialising ~1.5 M COO entries and deduplicating them Python-side.
OpenMP on the hot loops. scatter_into_csr is element-parallel with #pragma omp atomic on shared CSR slots. build_csr_pattern Steps B and C run row-parallel with a per-thread marker buffer (classic FE row-colouring). Numerically bit-exact against the serial build on PTR — atomic adds preserve the output despite non-deterministic ordering.

Reproducer:

pytest tests/benchmarks/bench_assembly.py \\
    --benchmark-only -o python_files=bench_*.py \\
    -k 'ptr_thread_scaling'

Strain post-processing (`Model.eel`)#

The elastic-strain feature (MAPDL S,EPEL equivalent) evaluates ε(ξ_node) = B(ξ_node) · u_e per element and optionally averages to the nodes (PLNSOL style). The batch kernel is OpenMP-parallel across elements.

Raw C++ strain kernel scaling (SOLID185 hex plate, median of 3 runs after warm-up):

Mesh	n_elem	T=1	T=2	T=4	T=8	T=16	T=16 vs T=1
50×50×2	5 000	1.5 ms	0.9 ms	0.5 ms	0.3 ms	0.1 ms	×15
100×100×2	20 000	5.2 ms	2.7 ms	2.0 ms	1.0 ms	0.5 ms	×10
150×150×4	90 000	32.4 ms	16.9 ms	9.6 ms	6.4 ms	4.3 ms	×7.5
200×200×4	160 000	58.0 ms	31.0 ms	18.2 ms	14.1 ms	7.8 ms	×7.4

Full Model.eel wall (includes Python-side nodal averaging via np.searchsorted + np.add.at):

Mesh	n_dof	T=1 avg	T=2 avg	T=4 avg	T=8 avg	T=16 avg
50×50×2	23 409	9.1 ms	8.4 ms	8.2 ms	8.0 ms	8.0 ms
100×100×2	91 809	36.7 ms	34.9 ms	33.2 ms	32.4 ms	33.4 ms
150×150×4	342 015	169.9 ms	150.9 ms	145.0 ms	142.6 ms	143.0 ms
200×200×4	606 015	308.9 ms	287.2 ms	280.5 ms	285.9 ms	292.5 ms

Threads help the strain kernel itself (~×7 at T=16 on 200×200×4) but don’t yet move the needle on Model.eel end-to-end — at T=1 on 200×200×4 the C++ batch is 58 ms out of 309 ms total, so the remaining ~250 ms lives in the unparallelised Python scatter/average. Vectorising or moving the nodal-average reduction to C++ is the next opportunity.

End-to-end PTR modal wallclock #

The Purdue Transonic Rotor represents the production-style modal workload: 57 k DOFs, 5166 cells, mixed SOLID185/186/186W/186P/187 topology, per-cell mat_id (5166 distinct IDs all pointing at the same 304 stainless material). Running solve_modal(..., n_modes=12, eigen_solver="arpack", linear_solver="auto") at OMP_NUM_THREADS=4 on the hardware above:

Stack tip	Modal wallclock	Landings in play
2026-04-23 morning	~14.6 s	Pre-Pardiso-autoresolver; SuperLU fallback.
2026-04-23 evening	~4.6 s	Pardiso autoresolved; pre-T4-P11.
2026-04-24 (current)	~3.0 s	#110 (parallel assembly) + #112 (Pardiso ia/ja cache) + #87/#88/#95/#97/#98/#109 memory chain.

The sub-second breakdown at 2026-04-24 is ~0.5 s total on the assembly + BC-reduce side (down from ~1.5 s pre-#110), ~1 s on the Pardiso factor, and ~1.5 s across ARPACK’s ~40 shift-invert iterations. The remaining lever is the per-iteration cost — see the open “perf” PRs at any given time for incremental work against it.

Reproducer:

pytest tests/benchmarks/bench_solve.py::test_modal_solve_ptr_12_modes \
    --benchmark-only -o python_files=bench_*.py

PR #101 → #116 progression — hashed by commit #

For developers auditing where each landing moved the needle.

The table below is generated by walking every merge commit on main between PR #101 (the post-#94/#97/#98 memory-push snapshot) and PR #116 (pardiso _check_A / _check_b skip), rebuilding the native extension at each step via uv pip install --no-build-isolation -e . and re-running the pipeline bench (3 mesh sizes, modal solve, OMP_NUM_THREADS=4, P-core-pinned). The walker lives at perf/walk_history.py; each per-commit JSON sits at perf/snapshots/walk_<hash>.json so the progression can be regenerated without re-walking history.

What to read. Wall-time and peak RSS are per-size. Δrel vs #101 is the relative deviation of the first natural frequency against the earliest-commit value — the column that matters for the recurring “did accuracy get worse?” question.

80×80×2 (215 k DOFs) — where the progression is legible

PR	commit	wall (s)	peak MB	f1 (Hz)	Δrel vs #101
#101	ee9645a	3.87	521.7	10.14569036801	0
#102	df08aba	2.72	512.1	10.14569037717	9.0e-10
#103	4c32785	2.75	491.4	10.14569037486	6.8e-10
#104	5d0b452	2.61	489.7	10.14569036680	1.2e-10
#109	1787e1e	2.61	489.2	10.14569037048	2.4e-10
#110	ae29ac6	3.29	489.9	10.14569038142	1.3e-9
#112	87e6f82	2.26	483.6	10.14569037848	1.0e-9
#115	304cb89	2.52	483.5	10.14569037469	6.6e-10
#116	1b3e144	2.48	483.0	10.14569036991	1.9e-10
#113	8652ace	2.57	483.0	10.14569036848	4.6e-11

48×48×2 (78 k DOFs)

PR	commit	wall (s)	peak MB	f1 (Hz)	Δrel vs #101
#101	ee9645a	0.76	304.2	13.19744738477	0
#102	df08aba	0.92	300.8	13.19744738477	6.1e-15
#103	4c32785	0.88	300.7	13.19744738477	7.0e-15
#104	5d0b452	0.66	297.7	13.19744738477	6.2e-15
#109	1787e1e	0.69	297.4	13.19744738477	4.0e-15
#110	ae29ac6	0.66	296.9	13.19744737985	3.7e-10
#112	87e6f82	0.80	298.3	13.19744738396	6.2e-11
#115	304cb89	0.84	297.9	13.19744738829	2.7e-10
#116	1b3e144	0.81	297.3	13.19744738594	8.9e-11
#113	8652ace	0.73	297.8	13.19744738653	1.3e-10

32×32×2 (35 k DOFs) — Δrel dominated by machine precision

At this size the first mode converges fully; Δrel stays below 1e-9 and the jumps align cleanly with ARPACK’s stochastic convergence floor. See perf/snapshots/walk_*.json for the full table.

Ordered by magnitude of wall-time or RSS delta at 80×80×2:

#102 (refactor(model): grid as sole source of truth) — wall 3.87 → 2.72 s (−30 %), RSS 521 → 512 MB. Eliminates parallel _nodes / _elements dicts that mirrored the pyvista grid. The dict→grid refactor removed one O(n_nodes) Python walk per assembly call.
#103 (pardiso size_limit_storage=0) — RSS 512 → 491 MB (−4 %). Stops pypardiso’s internal full-A.copy() cache.
#104 (malloc_trim(0) at BC boundary) — wall 2.75 → 2.61 s (−5 %). Returns glibc arena scratch between assembly and factor. Bigger wins at larger scales (>100 k DOFs) where the element-batch transient is larger.
#109 (drop DOF layout caches) — no measurable change at 80×80×2. Scales with node count; modeled saving was ~0.6 KB per node, dominated by the Pardiso factor ceiling at this mesh.
#110 (parallel C++ K/M assembly) — wall 2.61 → 3.29 s at this size. Caveat: scikit-build-core does not always re-compile the C extension between walker checkouts (build cached at ~0.9 s); the number above may reflect the last-built .so rather than #110’s actual parallel path. On PTR rotor-scale meshes (mixed-topology) the PR’s own numbers show 6× speedup at 8 threads. Re-run the walker after rm -rf build/ for a clean per-commit measurement.
#112 (pardiso 1-based ia/ja cached) — wall 3.29 → 2.26 s (−31 % vs the just-prior sample). Eliminates ~65 MB of alloc/free per solve × ~30 solves per modal_solve.
#115 (vectorize _assemble layout lookup) — measured as slightly slower at 80×80×2 here; the win shows cleanly in the assembly-only bench (_bc_reduce_and_release() is 25-40 % faster at mid-ladder per the PR #115 measurements) but the 3-second end-to-end pipeline is dominated by Pardiso and eigsh.
#116 (pardiso skip _check_A/_check_b) — modest; removes per-solve O(n) indptr scan. Wins scale with solve count.
#113 (single-pass symmetric matvec from triu CSR) — wall at this size is ~flat vs #116; on PTR rotor-scale where M @ x dominates eigsh iteration cost the PR reports clearer wins.

No. First-frequency Δrel vs the #101 baseline stays in the 1e-15 → 1e-9 range across the whole walk — well below the ARPACK tol=1e-12 convergence floor, which on a plate modal problem gives roughly 1e-7 of stochastic first-mode drift between independent solver invocations. The jumps visible at individual commits (for example #110 at 32×32×2 moving from 8e-16 to 4.5e-11) are ARPACK’s iteration count shifting by one or two, not a numerical regression; the \(\Delta_{rel}\) value oscillates rather than drifting monotonically toward larger error.

If an MAPDL-comparison snapshot (see PERFORMANCE.md) shows a larger Δrel at the top of its size ladder — for example 2e-7 at 224×224×2 — that is the ARPACK-vs-Lanczos gap against the MAPDL reference, not a progression regression: the absolute value sits at the ARPACK tolerance floor for every post-#101 commit on the walker above.

# Walk the default commit set (above) and regenerate snapshots.
python perf/walk_history.py

# Narrower walk, larger mesh ladder.
python perf/walk_history.py \
    --commits 5d0b452 ae29ac6 87e6f82 1b3e144 \
    --sizes 64x64x2 96x96x2 128x128x2

Each commit writes perf/snapshots/walk_<hash>.json (skipped if already present so reruns are incremental). The walker auto- stashes local edits and restores the starting branch on exit, so it’s safe to run from any branch.

The `PERFORMANCE.md` changelog #

This page is a curated user-facing view. For the full dated progression of every landing — including per-element micro numbers, older pipeline sweeps, and the exact commit SHAs — see PERFORMANCE.md in the repository root.