Performance =========== This page summarises where wall-time and memory go in femorph-solver's modal + static pipelines and how each landing has pushed them down. femorph-solver-only numbers are reproducible from the benchmarks in the ``perf/`` directory; the full changelog (including older rows and a historical snapshot of femorph-solver vs Ansys MAPDL v22.2 captured by the maintainer on a licensed install) lives in :file:`PERFORMANCE.md` at the repository root. MAPDL is used there as a gold-standard reference only — this repository does not contain code that runs MAPDL. .. contents:: :local: :depth: 2 Hardware and reproducers ------------------------ - CPU: Intel Core i9-14900KF - OS: Linux 6.12.66 (NixOS) - Python 3.13.9, NumPy + SciPy on OpenBLAS (upstream wheels) - ``OMP_NUM_THREADS=4`` for end-to-end runs For end-to-end rotor wallclock on a production-style mixed-topology deck, see :ref:`ptr-modal-benches` below. Reproducers: .. code-block:: bash # micro-benchmarks (per-element kernels + plate assembly) pytest tests/benchmarks --benchmark-only -o python_files=bench_*.py # end-to-end modal pipeline sweep (femorph-solver only) PYTHONPATH=. .venv/bin/python perf/bench_pipeline.py # linear + eigen backend comparison PYTHONPATH=. .venv/bin/python perf/bench_solvers.py --size 60x60x2 # strain post-processing + OpenMP scaling OMP_NUM_THREADS=N PYTHONPATH=. .venv/bin/python perf/bench_strain.py Progression at a glance ----------------------- Every row is a landing that moved the modal-pipeline numbers. Micro columns are ``pytest tests/benchmarks`` medians on the 40×40×2 plate (15 129 DOF); end-to-end columns come from ``perf/bench_pipeline.py`` on the 100×100×2 plate (91 809 DOF). .. list-table:: :widths: 5 30 30 12 12 12 12 12 :header-rows: 1 * - # - Landing (date) - Headline change - ``assemble_k`` - ``assemble_m`` - modal (10 mds) - 40² peak RSS - 100² wall * - 0 - Baseline (2026-04-20) - pure Python ``BᵀCB`` - 619 ms - 420 ms - 1.89 s - — - — * - 1 - HEX8 C++ (2026-04-20) - nanobind per-element Ke/Me - 94 ms - 85 ms - 1.34 s - — - — * - 2 - HEX20 + 187 C++ (2026-04-21) - all 3D solids in C++ - 88.7 ms - 84.7 ms - 1.29 s - — - — * - 3 - Batch + vectorised (2026-04-21) - group-batched kernel + COO broadcast - 58.0 ms - 54.0 ms - 918 ms - 411 MB - 13.21 s * - 4 - CSR-direct C++ (2026-04-21) - ``build_csr_pattern`` + ``scatter_into_csr`` - 16.2 ms - 12.5 ms - 621 ms - 337 MB - — * - 5 - Auto-CHOLMOD (2026-04-21) - ``linear_solver="auto"`` → supernodal Cholesky - ≈16 ms - ≈13 ms - 247 ms - 289 MB - 2.29 s * - 6 - **Parallel assembly (2026-04-24)** - group-by-material-value + OMP scatter + unified C++ pattern - ≈18 ms - ≈20 ms - ≈240 ms - 289 MB - ≈2.2 s * - - **Δ row 5 vs row 0** - - **×39** - **×34** - **×7.7** - - After row 5 the modal pipeline is **no longer assembly-bound or solver-bound in Python** — ~80 % of wall lives inside CHOLMOD factorisation + ARPACK orthogonalisation (BLAS/LAPACK). Row 6's headline gains *do not show up on the single-material 40×40×2 plate* — that mesh is already one element group (all cells share a single ``mat_id`` + ``real_id``), so the ``mat_id`` fragmentation fix in row 6 is a no-op and the pattern is cached between K and M. The win appears on **mixed-topology decks with per-cell ``mat_id``** like the Purdue Transonic Rotor (``ptr.cdb`` — 57 k DOFs, 5166 cells across HEX8/186/186W/186P/187, one ``mat_id`` per cell), measured in :ref:`assembly-thread-scaling` below. Head-to-head vs MAPDL (2026-04-21) ---------------------------------- Full per-stage comparison on the HEX8 plate, 10-mode modal solve, swept from 1 k to ~1.36 M DOF. MAPDL v22.2 runs the same mesh (emitted as a CDB via :mod:`mapdl_archive`) in Docker with ``-smp`` on 4 threads. .. list-table:: :widths: 15 12 14 14 16 16 15 :header-rows: 1 * - Mesh - DOF - femorph-solver wall - MAPDL wall - Speedup - femorph-solver peak - MAPDL peak * - 10×10×2 - 1 089 - 0.015 s - 3.68 s - ×250 - 178 MB - 2 112 MB * - 40×40×2 - 15 129 - 0.249 s - 4.04 s - ×16 - 290 MB - 2 112 MB * - 100×100×2 - 91 809 - 2.293 s - 6.45 s - ×2.8 - 938 MB - 2 624 MB * - 150×150×4 - 342 015 - 11.59 s - 19.45 s - ×1.7 - 3 882 MB - 6 011 MB * - 200×200×4 - 606 015 - 21.52 s - 33.87 s - ×1.6 - 7 002 MB - 10 019 MB * - 300×300×4 - 1 359 015 - 52.64 s - 85.85 s - ×1.6 - 16 065 MB - 19 851 MB Observations ^^^^^^^^^^^^ - **Wall-time: femorph-solver finishes below MAPDL across the sweep on this benchmark.** The ratio settles at ~×1.6 from 200 k DOF upward — both codes are then bound by the BLAS cost of sparse Cholesky factorisation, so further narrowing stops. - **Assembly: femorph-solver's K+M takes 3.2×–6.0× less wall-time than MAPDL's element-matrix CP block here.** At 100×100×2 femorph-solver is 0.32 s vs MAPDL 1.91 s; at 300×300×4 it's 5.33 s vs 17.05 s. - **Eigensolve: femorph-solver's shift-invert runs 1.37×–1.72× shorter, and the gap widens with N**. CHOLMOD's AMD ordering produces progressively less fill than MAPDL's LANB block-Lanczos factor on these regular meshes. - **Peak memory: femorph-solver's RSS is 0.08× MAPDL's at 10×10×2 rising to 0.81× at 300×300×4.** MAPDL carries a ~2.1 GB floor of COMMON blocks + scratch; as N grows both codes converge on "factor fill-in". - **First-frequency agreement drifts with mesh refinement up to 100×100×2, then converges again at nz=4**. Δf₁ stays < 2 % for well-meshed cases; the elevated 6.8 % at 100×100×2 is a single-element-thick bending-dominated plate physics artefact (different ESF / shear defaults between the two HEX8 formulations), not a solver bug. .. _assembly-thread-scaling: Thread scaling on a mixed-topology deck (PTR rotor) --------------------------------------------------- The Purdue Transonic Rotor (``ptr.cdb`` — 18 954 nodes, 5 166 cells, 56 862 DOFs) exercises every assembly path the flat plate does not: four element types in one mesh (3069 HEX20 hex, 497 HEX20 pyramid, 19 HEX20 wedge, 1581 TET10 tet), per-cell ``mat_id`` (1..5166) all pointing at the same 304-stainless material, and a real-world factor that hits the scipy multi-group pattern fallback. Measurements below are cold ``stiffness_matrix`` / ``mass_matrix`` calls on an Intel Core i9-14900KF, pattern cache invalidated before each run. The "baseline (main)" column is ``main`` immediately before the T4-P11 landing — after the #102 / #103 / #104 perf series already in place — so the delta reflects only the T4-P11 work, not the earlier memory-reduction chain. .. list-table:: :widths: 28 12 12 12 22 14 :header-rows: 1 * - Stage / thread count - T=1 - T=4 - T=8 - baseline (main, T=1/4/8) - Δ @ T=8 * - ``stiffness_matrix`` (cold) - 322 ms - 182 ms - 127 ms - 648 / 440 / 364 ms - **×2.9** * - ``mass_matrix`` (cold) - 228 ms - 152 ms - 83 ms - 342 / 305 / 227 ms - **×2.7** Scaling from T=1 to T=8 on the branch is **×2.5 on K** and **×2.7 on M** — bounded by the pattern build (which is partially serial in Step A: the row-to-element inverse index) and the memory-bandwidth ceiling on the CSR scatter. Further wins would require either a parallel Step A, larger meshes that amortise OMP region overhead better, or a cache-tiled scatter. Three independent fixes compose: 1. **Group by material value, not** ``mat_id``. Keying ``Model._assemble``'s per-group loop on the hashed material tuple collapses 5166 singleton groups (one per ``mat_id``) back down to the 4 real element-type groups, restoring ``ke_batch`` / ``me_batch`` batching on decks that assign a unique ``mat_id`` per cell. ~2× on its own, single-threaded. 2. **Unified C++ pattern builder for mixed-topology meshes.** ``_build_csr_pattern`` pads each group's DOF array to the max ``ndof_per_elem`` with ``-1`` sentinels and dispatches to the single-group C++ builder, which already ignores ``-1`` entries. Eliminates the scipy ``coo_array → tocsr`` fallback that had been materialising ~1.5 M COO entries and deduplicating them Python-side. 3. **OpenMP on the hot loops.** ``scatter_into_csr`` is element-parallel with ``#pragma omp atomic`` on shared CSR slots. ``build_csr_pattern`` Steps B and C run row-parallel with a per-thread marker buffer (classic FE row-colouring). Numerically bit-exact against the serial build on PTR — atomic adds preserve the output despite non-deterministic ordering. Reproducer: .. code-block:: bash pytest tests/benchmarks/bench_assembly.py \\ --benchmark-only -o python_files=bench_*.py \\ -k 'ptr_thread_scaling' Strain post-processing (``Model.eel``) -------------------------------------- The elastic-strain feature (``MAPDL S,EPEL`` equivalent) evaluates ``ε(ξ_node) = B(ξ_node) · u_e`` per element and optionally averages to the nodes (``PLNSOL`` style). The batch kernel is OpenMP-parallel across elements. Raw C++ strain kernel scaling (HEX8 hex plate, median of 3 runs after warm-up): .. list-table:: :widths: 20 15 12 12 12 12 12 15 :header-rows: 1 * - Mesh - n_elem - T=1 - T=2 - T=4 - T=8 - T=16 - T=16 vs T=1 * - 50×50×2 - 5 000 - 1.5 ms - 0.9 ms - 0.5 ms - 0.3 ms - 0.1 ms - ×15 * - 100×100×2 - 20 000 - 5.2 ms - 2.7 ms - 2.0 ms - 1.0 ms - 0.5 ms - ×10 * - 150×150×4 - 90 000 - 32.4 ms - 16.9 ms - 9.6 ms - 6.4 ms - 4.3 ms - ×7.5 * - 200×200×4 - 160 000 - 58.0 ms - 31.0 ms - 18.2 ms - 14.1 ms - 7.8 ms - ×7.4 Full ``Model.eel`` wall (includes Python-side nodal averaging via ``np.searchsorted`` + ``np.add.at``): .. list-table:: :widths: 20 15 12 12 12 12 12 :header-rows: 1 * - Mesh - n_dof - T=1 avg - T=2 avg - T=4 avg - T=8 avg - T=16 avg * - 50×50×2 - 23 409 - 9.1 ms - 8.4 ms - 8.2 ms - 8.0 ms - 8.0 ms * - 100×100×2 - 91 809 - 36.7 ms - 34.9 ms - 33.2 ms - 32.4 ms - 33.4 ms * - 150×150×4 - 342 015 - 169.9 ms - 150.9 ms - 145.0 ms - 142.6 ms - 143.0 ms * - 200×200×4 - 606 015 - 308.9 ms - 287.2 ms - 280.5 ms - 285.9 ms - 292.5 ms Threads help the strain kernel itself (~×7 at T=16 on 200×200×4) but don't yet move the needle on ``Model.eel`` end-to-end — at T=1 on 200×200×4 the C++ batch is 58 ms out of 309 ms total, so the remaining ~250 ms lives in the unparallelised Python scatter/average. Vectorising or moving the nodal-average reduction to C++ is the next opportunity. .. _ptr-modal-benches: End-to-end PTR modal wallclock ------------------------------ The Purdue Transonic Rotor represents the production-style modal workload: 57 k DOFs, 5166 cells, mixed HEX8/186/186W/186P/187 topology, per-cell ``mat_id`` (5166 distinct IDs all pointing at the same 304 stainless material). Running ``solve_modal(..., n_modes=12, eigen_solver="arpack", linear_solver="auto")`` at ``OMP_NUM_THREADS=4`` on the hardware above: .. list-table:: :widths: 40 18 42 :header-rows: 1 * - Stack tip - Modal wallclock - Landings in play * - 2026-04-23 morning - ~14.6 s - Pre-Pardiso-autoresolver; SuperLU fallback. * - 2026-04-23 evening - ~4.6 s - Pardiso autoresolved; pre-T4-P11. * - 2026-04-24 (current) - **~3.0 s** - #110 (parallel assembly) + #112 (Pardiso ia/ja cache) + #87/#88/#95/#97/#98/#109 memory chain. The sub-second breakdown at 2026-04-24 is ~0.5 s total on the assembly + BC-reduce side (down from ~1.5 s pre-#110), ~1 s on the Pardiso factor, and ~1.5 s across ARPACK's ~40 shift-invert iterations. The remaining lever is the per-iteration cost — see the open "perf" PRs at any given time for incremental work against it. Reproducer: .. code-block:: bash pytest tests/benchmarks/bench_solve.py::test_modal_solve_ptr_12_modes \ --benchmark-only -o python_files=bench_*.py PR #101 → #116 progression — hashed by commit --------------------------------------------- *For developers auditing where each landing moved the needle.* The table below is generated by walking every merge commit on ``main`` between PR #101 (the post-#94/#97/#98 memory-push snapshot) and PR #116 (pardiso ``_check_A`` / ``_check_b`` skip), rebuilding the native extension at each step via ``uv pip install --no-build-isolation -e .`` and re-running the pipeline bench (3 mesh sizes, modal solve, ``OMP_NUM_THREADS=4``, P-core-pinned). The walker lives at :file:`perf/walk_history.py`; each per-commit JSON sits at :file:`perf/snapshots/walk_.json` so the progression can be regenerated without re-walking history. **What to read.** Wall-time and peak RSS are per-size. ``Δrel vs #101`` is the relative deviation of the first natural frequency against the earliest-commit value — the column that matters for the recurring "did accuracy get worse?" question. **80×80×2 (215 k DOFs) — where the progression is legible** +------+---------+------------+------------+----------------+----------------+ | PR | commit | wall (s) | peak MB | f1 (Hz) | Δrel vs #101 | +======+=========+============+============+================+================+ | #101 | ee9645a | 3.87 | 521.7 | 10.14569036801 | 0 | +------+---------+------------+------------+----------------+----------------+ | #102 | df08aba | 2.72 | 512.1 | 10.14569037717 | 9.0e-10 | +------+---------+------------+------------+----------------+----------------+ | #103 | 4c32785 | 2.75 | 491.4 | 10.14569037486 | 6.8e-10 | +------+---------+------------+------------+----------------+----------------+ | #104 | 5d0b452 | 2.61 | 489.7 | 10.14569036680 | 1.2e-10 | +------+---------+------------+------------+----------------+----------------+ | #109 | 1787e1e | 2.61 | 489.2 | 10.14569037048 | 2.4e-10 | +------+---------+------------+------------+----------------+----------------+ | #110 | ae29ac6 | 3.29 | 489.9 | 10.14569038142 | 1.3e-9 | +------+---------+------------+------------+----------------+----------------+ | #112 | 87e6f82 | 2.26 | 483.6 | 10.14569037848 | 1.0e-9 | +------+---------+------------+------------+----------------+----------------+ | #115 | 304cb89 | 2.52 | 483.5 | 10.14569037469 | 6.6e-10 | +------+---------+------------+------------+----------------+----------------+ | #116 | 1b3e144 | 2.48 | 483.0 | 10.14569036991 | 1.9e-10 | +------+---------+------------+------------+----------------+----------------+ | #113 | 8652ace | 2.57 | 483.0 | 10.14569036848 | 4.6e-11 | +------+---------+------------+------------+----------------+----------------+ **48×48×2 (78 k DOFs)** +------+---------+------------+------------+----------------+----------------+ | PR | commit | wall (s) | peak MB | f1 (Hz) | Δrel vs #101 | +======+=========+============+============+================+================+ | #101 | ee9645a | 0.76 | 304.2 | 13.19744738477 | 0 | +------+---------+------------+------------+----------------+----------------+ | #102 | df08aba | 0.92 | 300.8 | 13.19744738477 | 6.1e-15 | +------+---------+------------+------------+----------------+----------------+ | #103 | 4c32785 | 0.88 | 300.7 | 13.19744738477 | 7.0e-15 | +------+---------+------------+------------+----------------+----------------+ | #104 | 5d0b452 | 0.66 | 297.7 | 13.19744738477 | 6.2e-15 | +------+---------+------------+------------+----------------+----------------+ | #109 | 1787e1e | 0.69 | 297.4 | 13.19744738477 | 4.0e-15 | +------+---------+------------+------------+----------------+----------------+ | #110 | ae29ac6 | 0.66 | 296.9 | 13.19744737985 | 3.7e-10 | +------+---------+------------+------------+----------------+----------------+ | #112 | 87e6f82 | 0.80 | 298.3 | 13.19744738396 | 6.2e-11 | +------+---------+------------+------------+----------------+----------------+ | #115 | 304cb89 | 0.84 | 297.9 | 13.19744738829 | 2.7e-10 | +------+---------+------------+------------+----------------+----------------+ | #116 | 1b3e144 | 0.81 | 297.3 | 13.19744738594 | 8.9e-11 | +------+---------+------------+------------+----------------+----------------+ | #113 | 8652ace | 0.73 | 297.8 | 13.19744738653 | 1.3e-10 | +------+---------+------------+------------+----------------+----------------+ **32×32×2 (35 k DOFs) — Δrel dominated by machine precision** At this size the first mode converges fully; Δrel stays below 1e-9 and the jumps align cleanly with ARPACK's stochastic convergence floor. See :file:`perf/snapshots/walk_*.json` for the full table. What moved the needle ^^^^^^^^^^^^^^^^^^^^^ Ordered by magnitude of wall-time or RSS delta at 80×80×2: - **#102** (``refactor(model): grid as sole source of truth``) — wall 3.87 → 2.72 s (**−30 %**), RSS 521 → 512 MB. Eliminates parallel ``_nodes`` / ``_elements`` dicts that mirrored the pyvista grid. The dict→grid refactor removed one O(n_nodes) Python walk per assembly call. - **#103** (``pardiso size_limit_storage=0``) — RSS 512 → 491 MB (**−4 %**). Stops pypardiso's internal full-``A.copy()`` cache. - **#104** (``malloc_trim(0) at BC boundary``) — wall 2.75 → 2.61 s (**−5 %**). Returns glibc arena scratch between assembly and factor. Bigger wins at larger scales (>100 k DOFs) where the element-batch transient is larger. - **#109** (``drop DOF layout caches``) — no measurable change at 80×80×2. Scales with node count; modeled saving was ~0.6 KB per node, dominated by the Pardiso factor ceiling at this mesh. - **#110** (``parallel C++ K/M assembly``) — wall 2.61 → 3.29 s at this size. Caveat: ``scikit-build-core`` does not always re-compile the C extension between walker checkouts (build cached at ~0.9 s); the number above may reflect the last-built ``.so`` rather than ``#110``'s actual parallel path. On PTR rotor-scale meshes (mixed-topology) the PR's own numbers show 6× speedup at 8 threads. Re-run the walker after ``rm -rf build/`` for a clean per-commit measurement. - **#112** (``pardiso 1-based ia/ja cached``) — wall 3.29 → 2.26 s (**−31 %** vs the just-prior sample). Eliminates ~65 MB of alloc/free per solve × ~30 solves per ``modal_solve``. - **#115** (``vectorize _assemble layout lookup``) — measured as slightly slower at 80×80×2 here; the win shows cleanly in the assembly-only bench (:meth:`_bc_reduce_and_release` is 25-40 % faster at mid-ladder per the PR #115 measurements) but the 3-second end-to-end pipeline is dominated by Pardiso and eigsh. - **#116** (``pardiso skip _check_A/_check_b``) — modest; removes per-solve O(n) indptr scan. Wins scale with solve count. - **#113** (``single-pass symmetric matvec from triu CSR``) — wall at this size is ~flat vs #116; on PTR rotor-scale where M @ x dominates eigsh iteration cost the PR reports clearer wins. Accuracy — did it regress? ^^^^^^^^^^^^^^^^^^^^^^^^^^ **No.** First-frequency Δrel vs the #101 baseline stays in the ``1e-15`` → ``1e-9`` range across the whole walk — well below the ARPACK ``tol=1e-12`` convergence floor, which on a plate modal problem gives roughly ``1e-7`` of stochastic first-mode drift between independent solver invocations. The jumps visible at individual commits (for example ``#110`` at 32×32×2 moving from ``8e-16`` to ``4.5e-11``) are ARPACK's iteration count shifting by one or two, not a numerical regression; the :math:`\Delta_{rel}` value oscillates rather than drifting monotonically toward larger error. If an MAPDL-comparison snapshot (see :file:`PERFORMANCE.md`) shows a larger Δrel at the top of its size ladder — for example ``2e-7`` at 224×224×2 — that is the **ARPACK-vs-Lanczos** gap against the MAPDL reference, not a progression regression: the absolute value sits at the ARPACK tolerance floor for every post-#101 commit on the walker above. Reproducing ^^^^^^^^^^^ .. code-block:: bash # Walk the default commit set (above) and regenerate snapshots. python perf/walk_history.py # Narrower walk, larger mesh ladder. python perf/walk_history.py \ --commits 5d0b452 ae29ac6 87e6f82 1b3e144 \ --sizes 64x64x2 96x96x2 128x128x2 Each commit writes :file:`perf/snapshots/walk_.json` (skipped if already present so reruns are incremental). The walker auto- stashes local edits and restores the starting branch on exit, so it's safe to run from any branch. The :file:`PERFORMANCE.md` changelog ------------------------------------- This page is a curated user-facing view. For the full dated progression of every landing — including per-element micro numbers, older pipeline sweeps, and the exact commit SHAs — see :file:`PERFORMANCE.md` in the repository root.