Performance#
This page summarises where wall-time and memory go in femorph-solver’s modal +
static pipelines and how each landing has pushed them down. femorph-solver-only
numbers are reproducible from the benchmarks in the perf/
directory; the full changelog (including older rows and a historical
snapshot of femorph-solver vs Ansys MAPDL v22.2 captured by the maintainer on a
licensed install) lives in PERFORMANCE.md at the repository
root. MAPDL is used there as a gold-standard reference only —
this repository does not contain code that runs MAPDL.
Hardware and reproducers#
CPU: Intel Core i9-14900KF
OS: Linux 6.12.66 (NixOS)
Python 3.13.9, NumPy + SciPy on OpenBLAS (upstream wheels)
OMP_NUM_THREADS=4for end-to-end runs
Reproducers:
# micro-benchmarks (per-element kernels + plate assembly)
pytest tests/benchmarks --benchmark-only -o python_files=bench_*.py
# end-to-end modal pipeline sweep (femorph-solver only)
PYTHONPATH=. .venv/bin/python perf/bench_pipeline.py
# linear + eigen backend comparison
PYTHONPATH=. .venv/bin/python perf/bench_solvers.py --size 60x60x2
# strain post-processing + OpenMP scaling
OMP_NUM_THREADS=N PYTHONPATH=. .venv/bin/python perf/bench_strain.py
Progression at a glance#
Every row is a landing that moved the modal-pipeline numbers. Micro
columns are pytest tests/benchmarks medians on the 40×40×2 plate
(15 129 DOF); end-to-end columns come from perf/bench_pipeline.py
on the 100×100×2 plate (91 809 DOF).
# |
Landing (date) |
Headline change |
|
|
modal (10 mds) |
40² peak RSS |
100² wall |
|---|---|---|---|---|---|---|---|
0 |
Baseline (2026-04-20) |
pure Python |
619 ms |
420 ms |
1.89 s |
— |
— |
1 |
SOLID185 C++ (2026-04-20) |
nanobind per-element Ke/Me |
94 ms |
85 ms |
1.34 s |
— |
— |
2 |
SOLID186 + 187 C++ (2026-04-21) |
all 3D solids in C++ |
88.7 ms |
84.7 ms |
1.29 s |
— |
— |
3 |
Batch + vectorised (2026-04-21) |
group-batched kernel + COO broadcast |
58.0 ms |
54.0 ms |
918 ms |
411 MB |
13.21 s |
4 |
CSR-direct C++ (2026-04-21) |
|
16.2 ms |
12.5 ms |
621 ms |
337 MB |
— |
5 |
Auto-CHOLMOD (2026-04-21) |
|
≈16 ms |
≈13 ms |
247 ms |
289 MB |
2.29 s |
6 |
Parallel assembly (2026-04-24) |
group-by-material-value + OMP scatter + unified C++ pattern |
≈18 ms |
≈20 ms |
≈240 ms |
289 MB |
≈2.2 s |
Δ row 5 vs row 0 |
×39 |
×34 |
×7.7 |
After row 5 the modal pipeline is no longer assembly-bound or solver-bound in Python — ~80 % of wall lives inside CHOLMOD factorisation + ARPACK orthogonalisation (BLAS/LAPACK).
Row 6’s headline gains do not show up on the single-material
40×40×2 plate — that mesh is already one element group (all cells
share a single mat_id + real_id), so the mat_id
fragmentation fix in row 6 is a no-op and the pattern is cached
between K and M. The win appears on mixed-topology decks with
per-cell ``mat_id`` like the Purdue Transonic Rotor (ptr.cdb
— 57 k DOFs, 5166 cells across SOLID185/186/186W/186P/187, one
mat_id per cell), measured in
Thread scaling on a mixed-topology deck (PTR rotor) below.
Head-to-head vs MAPDL (2026-04-21)#
Full per-stage comparison on the SOLID185 plate, 10-mode modal solve,
swept from 1 k to ~1.36 M DOF. MAPDL v22.2 runs the same mesh (emitted
as a CDB via mapdl_archive) in Docker with -smp on 4 threads.
Mesh |
DOF |
femorph-solver wall |
MAPDL wall |
Speedup |
femorph-solver peak |
MAPDL peak |
|---|---|---|---|---|---|---|
10×10×2 |
1 089 |
0.015 s |
3.68 s |
×250 |
178 MB |
2 112 MB |
40×40×2 |
15 129 |
0.249 s |
4.04 s |
×16 |
290 MB |
2 112 MB |
100×100×2 |
91 809 |
2.293 s |
6.45 s |
×2.8 |
938 MB |
2 624 MB |
150×150×4 |
342 015 |
11.59 s |
19.45 s |
×1.7 |
3 882 MB |
6 011 MB |
200×200×4 |
606 015 |
21.52 s |
33.87 s |
×1.6 |
7 002 MB |
10 019 MB |
300×300×4 |
1 359 015 |
52.64 s |
85.85 s |
×1.6 |
16 065 MB |
19 851 MB |
Observations#
Wall-time: femorph-solver finishes below MAPDL across the sweep on this benchmark. The ratio settles at ~×1.6 from 200 k DOF upward — both codes are then bound by the BLAS cost of sparse Cholesky factorisation, so further narrowing stops.
Assembly: femorph-solver’s K+M takes 3.2×–6.0× less wall-time than MAPDL’s element-matrix CP block here. At 100×100×2 femorph-solver is 0.32 s vs MAPDL 1.91 s; at 300×300×4 it’s 5.33 s vs 17.05 s.
Eigensolve: femorph-solver’s shift-invert runs 1.37×–1.72× shorter, and the gap widens with N. CHOLMOD’s AMD ordering produces progressively less fill than MAPDL’s LANB block-Lanczos factor on these regular meshes.
Peak memory: femorph-solver’s RSS is 0.08× MAPDL’s at 10×10×2 rising to 0.81× at 300×300×4. MAPDL carries a ~2.1 GB floor of COMMON blocks + scratch; as N grows both codes converge on “factor fill-in”.
First-frequency agreement drifts with mesh refinement up to 100×100×2, then converges again at nz=4. Δf₁ stays < 2 % for well-meshed cases; the elevated 6.8 % at 100×100×2 is a single-element-thick bending-dominated plate physics artefact (different ESF / shear defaults between the two SOLID185 formulations), not a solver bug.
Thread scaling on a mixed-topology deck (PTR rotor)#
The Purdue Transonic Rotor (ptr.cdb — 18 954 nodes, 5 166 cells,
56 862 DOFs) exercises every assembly path the flat plate does not:
four element types in one mesh (3069 SOLID186 hex, 497 SOLID186
pyramid, 19 SOLID186 wedge, 1581 SOLID187 tet), per-cell mat_id
(1..5166) all pointing at the same 304-stainless material, and a
real-world factor that hits the scipy multi-group pattern fallback.
Measurements below are cold stiffness_matrix / mass_matrix
calls on an Intel Core i9-14900KF, pattern cache invalidated before
each run. The “baseline (main)” column is main immediately
before the T4-P11 landing — after the #102 / #103 / #104 perf
series already in place — so the delta reflects only the T4-P11
work, not the earlier memory-reduction chain.
Stage / thread count |
T=1 |
T=4 |
T=8 |
baseline (main, T=1/4/8) |
Δ @ T=8 |
|---|---|---|---|---|---|
|
322 ms |
182 ms |
127 ms |
648 / 440 / 364 ms |
×2.9 |
|
228 ms |
152 ms |
83 ms |
342 / 305 / 227 ms |
×2.7 |
Scaling from T=1 to T=8 on the branch is ×2.5 on K and ×2.7 on M — bounded by the pattern build (which is partially serial in Step A: the row-to-element inverse index) and the memory-bandwidth ceiling on the CSR scatter. Further wins would require either a parallel Step A, larger meshes that amortise OMP region overhead better, or a cache-tiled scatter.
Three independent fixes compose:
Group by material value, not
mat_id. KeyingModel._assemble’s per-group loop on the hashed material tuple collapses 5166 singleton groups (one permat_id) back down to the 4 real element-type groups, restoringke_batch/me_batchbatching on decks that assign a uniquemat_idper cell. ~2× on its own, single-threaded.Unified C++ pattern builder for mixed-topology meshes.
_build_csr_patternpads each group’s DOF array to the maxndof_per_elemwith-1sentinels and dispatches to the single-group C++ builder, which already ignores-1entries. Eliminates the scipycoo_array → tocsrfallback that had been materialising ~1.5 M COO entries and deduplicating them Python-side.OpenMP on the hot loops.
scatter_into_csris element-parallel with#pragma omp atomicon shared CSR slots.build_csr_patternSteps B and C run row-parallel with a per-thread marker buffer (classic FE row-colouring). Numerically bit-exact against the serial build on PTR — atomic adds preserve the output despite non-deterministic ordering.
Reproducer:
pytest tests/benchmarks/bench_assembly.py \\
--benchmark-only -o python_files=bench_*.py \\
-k 'ptr_thread_scaling'
Strain post-processing (Model.eel)#
The elastic-strain feature (MAPDL S,EPEL equivalent) evaluates
ε(ξ_node) = B(ξ_node) · u_e per element and optionally averages
to the nodes (PLNSOL style). The batch kernel is OpenMP-parallel
across elements.
Raw C++ strain kernel scaling (SOLID185 hex plate, median of 3 runs after warm-up):
Mesh |
n_elem |
T=1 |
T=2 |
T=4 |
T=8 |
T=16 |
T=16 vs T=1 |
|---|---|---|---|---|---|---|---|
50×50×2 |
5 000 |
1.5 ms |
0.9 ms |
0.5 ms |
0.3 ms |
0.1 ms |
×15 |
100×100×2 |
20 000 |
5.2 ms |
2.7 ms |
2.0 ms |
1.0 ms |
0.5 ms |
×10 |
150×150×4 |
90 000 |
32.4 ms |
16.9 ms |
9.6 ms |
6.4 ms |
4.3 ms |
×7.5 |
200×200×4 |
160 000 |
58.0 ms |
31.0 ms |
18.2 ms |
14.1 ms |
7.8 ms |
×7.4 |
Full Model.eel wall (includes Python-side nodal averaging via
np.searchsorted + np.add.at):
Mesh |
n_dof |
T=1 avg |
T=2 avg |
T=4 avg |
T=8 avg |
T=16 avg |
|---|---|---|---|---|---|---|
50×50×2 |
23 409 |
9.1 ms |
8.4 ms |
8.2 ms |
8.0 ms |
8.0 ms |
100×100×2 |
91 809 |
36.7 ms |
34.9 ms |
33.2 ms |
32.4 ms |
33.4 ms |
150×150×4 |
342 015 |
169.9 ms |
150.9 ms |
145.0 ms |
142.6 ms |
143.0 ms |
200×200×4 |
606 015 |
308.9 ms |
287.2 ms |
280.5 ms |
285.9 ms |
292.5 ms |
Threads help the strain kernel itself (~×7 at T=16 on 200×200×4) but
don’t yet move the needle on Model.eel end-to-end — at T=1 on
200×200×4 the C++ batch is 58 ms out of 309 ms total, so the
remaining ~250 ms lives in the unparallelised Python scatter/average.
Vectorising or moving the nodal-average reduction to C++ is the next
opportunity.
PR #101 → #116 progression — hashed by commit#
For developers auditing where each landing moved the needle.
The table below is generated by walking every merge commit on
main between PR #101 (the post-#94/#97/#98 memory-push snapshot)
and PR #116 (pardiso _check_A / _check_b skip), rebuilding
the native extension at each step via
uv pip install --no-build-isolation -e . and re-running the
pipeline bench (3 mesh sizes, modal solve, OMP_NUM_THREADS=4,
P-core-pinned). The walker lives at perf/walk_history.py;
each per-commit JSON sits at
perf/snapshots/walk_<hash>.json so the progression can be
regenerated without re-walking history.
What to read. Wall-time and peak RSS are per-size. Δrel vs
#101 is the relative deviation of the first natural frequency
against the earliest-commit value — the column that matters for
the recurring “did accuracy get worse?” question.
80×80×2 (215 k DOFs) — where the progression is legible
PR |
commit |
wall (s) |
peak MB |
f1 (Hz) |
Δrel vs #101 |
|---|---|---|---|---|---|
#101 |
ee9645a |
3.87 |
521.7 |
10.14569036801 |
0 |
#102 |
df08aba |
2.72 |
512.1 |
10.14569037717 |
9.0e-10 |
#103 |
4c32785 |
2.75 |
491.4 |
10.14569037486 |
6.8e-10 |
#104 |
5d0b452 |
2.61 |
489.7 |
10.14569036680 |
1.2e-10 |
#109 |
1787e1e |
2.61 |
489.2 |
10.14569037048 |
2.4e-10 |
#110 |
ae29ac6 |
3.29 |
489.9 |
10.14569038142 |
1.3e-9 |
#112 |
87e6f82 |
2.26 |
483.6 |
10.14569037848 |
1.0e-9 |
#115 |
304cb89 |
2.52 |
483.5 |
10.14569037469 |
6.6e-10 |
#116 |
1b3e144 |
2.48 |
483.0 |
10.14569036991 |
1.9e-10 |
#113 |
8652ace |
2.57 |
483.0 |
10.14569036848 |
4.6e-11 |
48×48×2 (78 k DOFs)
PR |
commit |
wall (s) |
peak MB |
f1 (Hz) |
Δrel vs #101 |
|---|---|---|---|---|---|
#101 |
ee9645a |
0.76 |
304.2 |
13.19744738477 |
0 |
#102 |
df08aba |
0.92 |
300.8 |
13.19744738477 |
6.1e-15 |
#103 |
4c32785 |
0.88 |
300.7 |
13.19744738477 |
7.0e-15 |
#104 |
5d0b452 |
0.66 |
297.7 |
13.19744738477 |
6.2e-15 |
#109 |
1787e1e |
0.69 |
297.4 |
13.19744738477 |
4.0e-15 |
#110 |
ae29ac6 |
0.66 |
296.9 |
13.19744737985 |
3.7e-10 |
#112 |
87e6f82 |
0.80 |
298.3 |
13.19744738396 |
6.2e-11 |
#115 |
304cb89 |
0.84 |
297.9 |
13.19744738829 |
2.7e-10 |
#116 |
1b3e144 |
0.81 |
297.3 |
13.19744738594 |
8.9e-11 |
#113 |
8652ace |
0.73 |
297.8 |
13.19744738653 |
1.3e-10 |
32×32×2 (35 k DOFs) — Δrel dominated by machine precision
At this size the first mode converges fully; Δrel stays below 1e-9
and the jumps align cleanly with ARPACK’s stochastic convergence
floor. See perf/snapshots/walk_*.json for the full table.
Ordered by magnitude of wall-time or RSS delta at 80×80×2:
#102 (
refactor(model): grid as sole source of truth) — wall 3.87 → 2.72 s (−30 %), RSS 521 → 512 MB. Eliminates parallel_nodes/_elementsdicts that mirrored the pyvista grid. The dict→grid refactor removed one O(n_nodes) Python walk per assembly call.#103 (
pardiso size_limit_storage=0) — RSS 512 → 491 MB (−4 %). Stops pypardiso’s internal full-A.copy()cache.#104 (
malloc_trim(0) at BC boundary) — wall 2.75 → 2.61 s (−5 %). Returns glibc arena scratch between assembly and factor. Bigger wins at larger scales (>100 k DOFs) where the element-batch transient is larger.#109 (
drop DOF layout caches) — no measurable change at 80×80×2. Scales with node count; modeled saving was ~0.6 KB per node, dominated by the Pardiso factor ceiling at this mesh.#110 (
parallel C++ K/M assembly) — wall 2.61 → 3.29 s at this size. Caveat:scikit-build-coredoes not always re-compile the C extension between walker checkouts (build cached at ~0.9 s); the number above may reflect the last-built.sorather than#110’s actual parallel path. On PTR rotor-scale meshes (mixed-topology) the PR’s own numbers show 6× speedup at 8 threads. Re-run the walker afterrm -rf build/for a clean per-commit measurement.#112 (
pardiso 1-based ia/ja cached) — wall 3.29 → 2.26 s (−31 % vs the just-prior sample). Eliminates ~65 MB of alloc/free per solve × ~30 solves permodal_solve.#115 (
vectorize _assemble layout lookup) — measured as slightly slower at 80×80×2 here; the win shows cleanly in the assembly-only bench (_bc_reduce_and_release()is 25-40 % faster at mid-ladder per the PR #115 measurements) but the 3-second end-to-end pipeline is dominated by Pardiso and eigsh.#116 (
pardiso skip _check_A/_check_b) — modest; removes per-solve O(n) indptr scan. Wins scale with solve count.#113 (
single-pass symmetric matvec from triu CSR) — wall at this size is ~flat vs #116; on PTR rotor-scale where M @ x dominates eigsh iteration cost the PR reports clearer wins.
No. First-frequency Δrel vs the #101 baseline stays in the
1e-15 → 1e-9 range across the whole walk — well below the
ARPACK tol=1e-12 convergence floor, which on a plate modal
problem gives roughly 1e-7 of stochastic first-mode drift
between independent solver invocations. The jumps visible at
individual commits (for example #110 at 32×32×2 moving from
8e-16 to 4.5e-11) are ARPACK’s iteration count shifting
by one or two, not a numerical regression; the \(\Delta_{rel}\)
value oscillates rather than drifting monotonically toward larger
error.
If an MAPDL-comparison snapshot (see PERFORMANCE.md)
shows a larger Δrel at the top of its size ladder — for example
2e-7 at 224×224×2 — that is the ARPACK-vs-Lanczos gap
against the MAPDL reference, not a progression regression: the
absolute value sits at the ARPACK tolerance floor for every
post-#101 commit on the walker above.
# Walk the default commit set (above) and regenerate snapshots.
python perf/walk_history.py
# Narrower walk, larger mesh ladder.
python perf/walk_history.py \
--commits 5d0b452 ae29ac6 87e6f82 1b3e144 \
--sizes 64x64x2 96x96x2 128x128x2
Each commit writes perf/snapshots/walk_<hash>.json (skipped
if already present so reruns are incremental). The walker auto-
stashes local edits and restores the starting branch on exit, so
it’s safe to run from any branch.
The PERFORMANCE.md changelog#
This page is a curated user-facing view. For the full dated
progression of every landing — including per-element micro numbers,
older pipeline sweeps, and the exact commit SHAs — see
PERFORMANCE.md in the repository root.