Performance
===========

This page summarises where wall-time and memory go in femorph-solver's modal +
static pipelines and how each landing has pushed them down.  femorph-solver-only
numbers are reproducible from the benchmarks in the ``perf/``
directory; the full changelog (including older rows and a historical
snapshot of femorph-solver vs Ansys MAPDL v22.2 captured by the maintainer on a
licensed install) lives in :file:`PERFORMANCE.md` at the repository
root.  MAPDL is used there as a gold-standard reference only —
this repository does not contain code that runs MAPDL.

.. contents::
   :local:
   :depth: 2

Hardware and reproducers
------------------------

- CPU: Intel Core i9-14900KF
- OS: Linux 6.12.66 (NixOS)
- Python 3.13.9, NumPy + SciPy on OpenBLAS (upstream wheels)
- ``OMP_NUM_THREADS=4`` for end-to-end runs

For end-to-end rotor wallclock on a production-style mixed-topology
deck, see :ref:`ptr-modal-benches` below.

Reproducers:

.. code-block:: bash

   # micro-benchmarks (per-element kernels + plate assembly)
   pytest tests/benchmarks --benchmark-only -o python_files=bench_*.py

   # end-to-end modal pipeline sweep (femorph-solver only)
   PYTHONPATH=. .venv/bin/python perf/bench_pipeline.py

   # linear + eigen backend comparison
   PYTHONPATH=. .venv/bin/python perf/bench_solvers.py --size 60x60x2

   # strain post-processing + OpenMP scaling
   OMP_NUM_THREADS=N PYTHONPATH=. .venv/bin/python perf/bench_strain.py

Progression at a glance
-----------------------

Every row is a landing that moved the modal-pipeline numbers.  Micro
columns are ``pytest tests/benchmarks`` medians on the 40×40×2 plate
(15 129 DOF); end-to-end columns come from ``perf/bench_pipeline.py``
on the 100×100×2 plate (91 809 DOF).

.. list-table::
   :widths: 5 30 30 12 12 12 12 12
   :header-rows: 1

   * - #
     - Landing (date)
     - Headline change
     - ``assemble_k``
     - ``assemble_m``
     - modal (10 mds)
     - 40² peak RSS
     - 100² wall
   * - 0
     - Baseline (2026-04-20)
     - pure Python ``BᵀCB``
     - 619 ms
     - 420 ms
     - 1.89 s
     - —
     - —
   * - 1
     - HEX8 C++ (2026-04-20)
     - nanobind per-element Ke/Me
     - 94 ms
     - 85 ms
     - 1.34 s
     - —
     - —
   * - 2
     - HEX20 + 187 C++ (2026-04-21)
     - all 3D solids in C++
     - 88.7 ms
     - 84.7 ms
     - 1.29 s
     - —
     - —
   * - 3
     - Batch + vectorised (2026-04-21)
     - group-batched kernel + COO broadcast
     - 58.0 ms
     - 54.0 ms
     - 918 ms
     - 411 MB
     - 13.21 s
   * - 4
     - CSR-direct C++ (2026-04-21)
     - ``build_csr_pattern`` + ``scatter_into_csr``
     - 16.2 ms
     - 12.5 ms
     - 621 ms
     - 337 MB
     - —
   * - 5
     - Auto-CHOLMOD (2026-04-21)
     - ``linear_solver="auto"`` → supernodal Cholesky
     - ≈16 ms
     - ≈13 ms
     - 247 ms
     - 289 MB
     - 2.29 s
   * - 6
     - **Parallel assembly (2026-04-24)**
     - group-by-material-value + OMP scatter + unified C++ pattern
     - ≈18 ms
     - ≈20 ms
     - ≈240 ms
     - 289 MB
     - ≈2.2 s
   * -
     - **Δ row 5 vs row 0**
     -
     - **×39**
     - **×34**
     - **×7.7**
     -
     -

After row 5 the modal pipeline is **no longer assembly-bound or
solver-bound in Python** — ~80 % of wall lives inside CHOLMOD
factorisation + ARPACK orthogonalisation (BLAS/LAPACK).

Row 6's headline gains *do not show up on the single-material
40×40×2 plate* — that mesh is already one element group (all cells
share a single ``mat_id`` + ``real_id``), so the ``mat_id``
fragmentation fix in row 6 is a no-op and the pattern is cached
between K and M.  The win appears on **mixed-topology decks with
per-cell ``mat_id``** like the Purdue Transonic Rotor (``ptr.cdb``
— 57 k DOFs, 5166 cells across HEX8/186/186W/186P/187, one
``mat_id`` per cell), measured in
:ref:`assembly-thread-scaling` below.

Head-to-head vs MAPDL (2026-04-21)
----------------------------------

Full per-stage comparison on the HEX8 plate, 10-mode modal solve,
swept from 1 k to ~1.36 M DOF.  MAPDL v22.2 runs the same mesh (emitted
as a CDB via :mod:`mapdl_archive`) in Docker with ``-smp`` on 4 threads.

.. list-table::
   :widths: 15 12 14 14 16 16 15
   :header-rows: 1

   * - Mesh
     - DOF
     - femorph-solver wall
     - MAPDL wall
     - Speedup
     - femorph-solver peak
     - MAPDL peak
   * - 10×10×2
     - 1 089
     - 0.015 s
     - 3.68 s
     - ×250
     - 178 MB
     - 2 112 MB
   * - 40×40×2
     - 15 129
     - 0.249 s
     - 4.04 s
     - ×16
     - 290 MB
     - 2 112 MB
   * - 100×100×2
     - 91 809
     - 2.293 s
     - 6.45 s
     - ×2.8
     - 938 MB
     - 2 624 MB
   * - 150×150×4
     - 342 015
     - 11.59 s
     - 19.45 s
     - ×1.7
     - 3 882 MB
     - 6 011 MB
   * - 200×200×4
     - 606 015
     - 21.52 s
     - 33.87 s
     - ×1.6
     - 7 002 MB
     - 10 019 MB
   * - 300×300×4
     - 1 359 015
     - 52.64 s
     - 85.85 s
     - ×1.6
     - 16 065 MB
     - 19 851 MB

Observations
^^^^^^^^^^^^

- **Wall-time: femorph-solver finishes below MAPDL across the sweep on this
  benchmark.**  The ratio settles at ~×1.6 from 200 k DOF upward —
  both codes are then bound by the BLAS cost of sparse Cholesky
  factorisation, so further narrowing stops.
- **Assembly: femorph-solver's K+M takes 3.2×–6.0× less wall-time than MAPDL's
  element-matrix CP block here.**  At 100×100×2 femorph-solver is 0.32 s vs
  MAPDL 1.91 s; at 300×300×4 it's 5.33 s vs 17.05 s.
- **Eigensolve: femorph-solver's shift-invert runs 1.37×–1.72× shorter, and
  the gap widens with N**.  CHOLMOD's AMD ordering produces
  progressively less fill than MAPDL's LANB block-Lanczos factor on
  these regular meshes.
- **Peak memory: femorph-solver's RSS is 0.08× MAPDL's at 10×10×2 rising to
  0.81× at 300×300×4.**  MAPDL carries a ~2.1 GB floor of COMMON
  blocks + scratch; as N grows both codes converge on "factor
  fill-in".
- **First-frequency agreement drifts with mesh refinement up to
  100×100×2, then converges again at nz=4**.  Δf₁ stays < 2 % for
  well-meshed cases; the elevated 6.8 % at 100×100×2 is a
  single-element-thick bending-dominated plate physics artefact
  (different ESF / shear defaults between the two HEX8
  formulations), not a solver bug.

.. _assembly-thread-scaling:

Thread scaling on a mixed-topology deck (PTR rotor)
---------------------------------------------------

The Purdue Transonic Rotor (``ptr.cdb`` — 18 954 nodes, 5 166 cells,
56 862 DOFs) exercises every assembly path the flat plate does not:
four element types in one mesh (3069 HEX20 hex, 497 HEX20
pyramid, 19 HEX20 wedge, 1581 TET10 tet), per-cell ``mat_id``
(1..5166) all pointing at the same 304-stainless material, and a
real-world factor that hits the scipy multi-group pattern fallback.

Measurements below are cold ``stiffness_matrix`` / ``mass_matrix``
calls on an Intel Core i9-14900KF, pattern cache invalidated before
each run.  The "baseline (main)" column is ``main`` immediately
before the T4-P11 landing — after the #102 / #103 / #104 perf
series already in place — so the delta reflects only the T4-P11
work, not the earlier memory-reduction chain.

.. list-table::
   :widths: 28 12 12 12 22 14
   :header-rows: 1

   * - Stage / thread count
     - T=1
     - T=4
     - T=8
     - baseline (main, T=1/4/8)
     - Δ @ T=8
   * - ``stiffness_matrix`` (cold)
     - 322 ms
     - 182 ms
     - 127 ms
     - 648 / 440 / 364 ms
     - **×2.9**
   * - ``mass_matrix`` (cold)
     - 228 ms
     - 152 ms
     - 83 ms
     - 342 / 305 / 227 ms
     - **×2.7**

Scaling from T=1 to T=8 on the branch is **×2.5 on K** and **×2.7
on M** — bounded by the pattern build (which is partially serial in
Step A: the row-to-element inverse index) and the memory-bandwidth
ceiling on the CSR scatter.  Further wins would require either a
parallel Step A, larger meshes that amortise OMP region overhead
better, or a cache-tiled scatter.

Three independent fixes compose:

1. **Group by material value, not** ``mat_id``.  Keying
   ``Model._assemble``'s per-group loop on the hashed material
   tuple collapses 5166 singleton groups (one per ``mat_id``) back
   down to the 4 real element-type groups, restoring ``ke_batch`` /
   ``me_batch`` batching on decks that assign a unique ``mat_id``
   per cell.  ~2× on its own, single-threaded.

2. **Unified C++ pattern builder for mixed-topology meshes.**
   ``_build_csr_pattern`` pads each group's DOF array to the max
   ``ndof_per_elem`` with ``-1`` sentinels and dispatches to the
   single-group C++ builder, which already ignores ``-1`` entries.
   Eliminates the scipy ``coo_array → tocsr`` fallback that had
   been materialising ~1.5 M COO entries and deduplicating them
   Python-side.

3. **OpenMP on the hot loops.**  ``scatter_into_csr`` is
   element-parallel with ``#pragma omp atomic`` on shared CSR
   slots.  ``build_csr_pattern`` Steps B and C run row-parallel
   with a per-thread marker buffer (classic FE row-colouring).
   Numerically bit-exact against the serial build on PTR —
   atomic adds preserve the output despite non-deterministic
   ordering.

Reproducer:

.. code-block:: bash

   pytest tests/benchmarks/bench_assembly.py \\
       --benchmark-only -o python_files=bench_*.py \\
       -k 'ptr_thread_scaling'

Strain post-processing (``Model.eel``)
--------------------------------------

The elastic-strain feature (``MAPDL S,EPEL`` equivalent) evaluates
``ε(ξ_node) = B(ξ_node) · u_e`` per element and optionally averages
to the nodes (``PLNSOL`` style).  The batch kernel is OpenMP-parallel
across elements.

Raw C++ strain kernel scaling (HEX8 hex plate, median of 3 runs
after warm-up):

.. list-table::
   :widths: 20 15 12 12 12 12 12 15
   :header-rows: 1

   * - Mesh
     - n_elem
     - T=1
     - T=2
     - T=4
     - T=8
     - T=16
     - T=16 vs T=1
   * - 50×50×2
     - 5 000
     - 1.5 ms
     - 0.9 ms
     - 0.5 ms
     - 0.3 ms
     - 0.1 ms
     - ×15
   * - 100×100×2
     - 20 000
     - 5.2 ms
     - 2.7 ms
     - 2.0 ms
     - 1.0 ms
     - 0.5 ms
     - ×10
   * - 150×150×4
     - 90 000
     - 32.4 ms
     - 16.9 ms
     - 9.6 ms
     - 6.4 ms
     - 4.3 ms
     - ×7.5
   * - 200×200×4
     - 160 000
     - 58.0 ms
     - 31.0 ms
     - 18.2 ms
     - 14.1 ms
     - 7.8 ms
     - ×7.4

Full ``Model.eel`` wall (includes Python-side nodal averaging via
``np.searchsorted`` + ``np.add.at``):

.. list-table::
   :widths: 20 15 12 12 12 12 12
   :header-rows: 1

   * - Mesh
     - n_dof
     - T=1 avg
     - T=2 avg
     - T=4 avg
     - T=8 avg
     - T=16 avg
   * - 50×50×2
     - 23 409
     - 9.1 ms
     - 8.4 ms
     - 8.2 ms
     - 8.0 ms
     - 8.0 ms
   * - 100×100×2
     - 91 809
     - 36.7 ms
     - 34.9 ms
     - 33.2 ms
     - 32.4 ms
     - 33.4 ms
   * - 150×150×4
     - 342 015
     - 169.9 ms
     - 150.9 ms
     - 145.0 ms
     - 142.6 ms
     - 143.0 ms
   * - 200×200×4
     - 606 015
     - 308.9 ms
     - 287.2 ms
     - 280.5 ms
     - 285.9 ms
     - 292.5 ms

Threads help the strain kernel itself (~×7 at T=16 on 200×200×4) but
don't yet move the needle on ``Model.eel`` end-to-end — at T=1 on
200×200×4 the C++ batch is 58 ms out of 309 ms total, so the
remaining ~250 ms lives in the unparallelised Python scatter/average.
Vectorising or moving the nodal-average reduction to C++ is the next
opportunity.

.. _ptr-modal-benches:

End-to-end PTR modal wallclock
------------------------------

The Purdue Transonic Rotor represents the production-style modal
workload: 57 k DOFs, 5166 cells, mixed HEX8/186/186W/186P/187
topology, per-cell ``mat_id`` (5166 distinct IDs all pointing at the
same 304 stainless material).  Running ``solve_modal(..., n_modes=12,
eigen_solver="arpack", linear_solver="auto")`` at ``OMP_NUM_THREADS=4``
on the hardware above:

.. list-table::
   :widths: 40 18 42
   :header-rows: 1

   * - Stack tip
     - Modal wallclock
     - Landings in play
   * - 2026-04-23 morning
     - ~14.6 s
     - Pre-Pardiso-autoresolver; SuperLU fallback.
   * - 2026-04-23 evening
     - ~4.6 s
     - Pardiso autoresolved; pre-T4-P11.
   * - 2026-04-24 (current)
     - **~3.0 s**
     - #110 (parallel assembly) + #112 (Pardiso ia/ja cache) +
       #87/#88/#95/#97/#98/#109 memory chain.

The sub-second breakdown at 2026-04-24 is ~0.5 s total on the
assembly + BC-reduce side (down from ~1.5 s pre-#110), ~1 s on the
Pardiso factor, and ~1.5 s across ARPACK's ~40 shift-invert
iterations.  The remaining lever is the per-iteration cost — see the
open "perf" PRs at any given time for incremental work against it.

Reproducer:

.. code-block:: bash

   pytest tests/benchmarks/bench_solve.py::test_modal_solve_ptr_12_modes \
       --benchmark-only -o python_files=bench_*.py

PR #101 → #116 progression — hashed by commit
---------------------------------------------

*For developers auditing where each landing moved the needle.*

The table below is generated by walking every merge commit on
``main`` between PR #101 (the post-#94/#97/#98 memory-push snapshot)
and PR #116 (pardiso ``_check_A`` / ``_check_b`` skip), rebuilding
the native extension at each step via
``uv pip install --no-build-isolation -e .`` and re-running the
pipeline bench (3 mesh sizes, modal solve, ``OMP_NUM_THREADS=4``,
P-core-pinned).  The walker lives at :file:`perf/walk_history.py`;
each per-commit JSON sits at
:file:`perf/snapshots/walk_<hash>.json` so the progression can be
regenerated without re-walking history.

**What to read.**  Wall-time and peak RSS are per-size.  ``Δrel vs
#101`` is the relative deviation of the first natural frequency
against the earliest-commit value — the column that matters for
the recurring "did accuracy get worse?" question.

**80×80×2 (215 k DOFs) — where the progression is legible**

+------+---------+------------+------------+----------------+----------------+
| PR   | commit  | wall (s)   | peak MB    | f1 (Hz)        | Δrel vs #101   |
+======+=========+============+============+================+================+
| #101 | ee9645a |  3.87      |  521.7     | 10.14569036801 | 0              |
+------+---------+------------+------------+----------------+----------------+
| #102 | df08aba |  2.72      |  512.1     | 10.14569037717 | 9.0e-10        |
+------+---------+------------+------------+----------------+----------------+
| #103 | 4c32785 |  2.75      |  491.4     | 10.14569037486 | 6.8e-10        |
+------+---------+------------+------------+----------------+----------------+
| #104 | 5d0b452 |  2.61      |  489.7     | 10.14569036680 | 1.2e-10        |
+------+---------+------------+------------+----------------+----------------+
| #109 | 1787e1e |  2.61      |  489.2     | 10.14569037048 | 2.4e-10        |
+------+---------+------------+------------+----------------+----------------+
| #110 | ae29ac6 |  3.29      |  489.9     | 10.14569038142 | 1.3e-9         |
+------+---------+------------+------------+----------------+----------------+
| #112 | 87e6f82 |  2.26      |  483.6     | 10.14569037848 | 1.0e-9         |
+------+---------+------------+------------+----------------+----------------+
| #115 | 304cb89 |  2.52      |  483.5     | 10.14569037469 | 6.6e-10        |
+------+---------+------------+------------+----------------+----------------+
| #116 | 1b3e144 |  2.48      |  483.0     | 10.14569036991 | 1.9e-10        |
+------+---------+------------+------------+----------------+----------------+
| #113 | 8652ace |  2.57      |  483.0     | 10.14569036848 | 4.6e-11        |
+------+---------+------------+------------+----------------+----------------+

**48×48×2 (78 k DOFs)**

+------+---------+------------+------------+----------------+----------------+
| PR   | commit  | wall (s)   | peak MB    | f1 (Hz)        | Δrel vs #101   |
+======+=========+============+============+================+================+
| #101 | ee9645a |  0.76      |  304.2     | 13.19744738477 | 0              |
+------+---------+------------+------------+----------------+----------------+
| #102 | df08aba |  0.92      |  300.8     | 13.19744738477 | 6.1e-15        |
+------+---------+------------+------------+----------------+----------------+
| #103 | 4c32785 |  0.88      |  300.7     | 13.19744738477 | 7.0e-15        |
+------+---------+------------+------------+----------------+----------------+
| #104 | 5d0b452 |  0.66      |  297.7     | 13.19744738477 | 6.2e-15        |
+------+---------+------------+------------+----------------+----------------+
| #109 | 1787e1e |  0.69      |  297.4     | 13.19744738477 | 4.0e-15        |
+------+---------+------------+------------+----------------+----------------+
| #110 | ae29ac6 |  0.66      |  296.9     | 13.19744737985 | 3.7e-10        |
+------+---------+------------+------------+----------------+----------------+
| #112 | 87e6f82 |  0.80      |  298.3     | 13.19744738396 | 6.2e-11        |
+------+---------+------------+------------+----------------+----------------+
| #115 | 304cb89 |  0.84      |  297.9     | 13.19744738829 | 2.7e-10        |
+------+---------+------------+------------+----------------+----------------+
| #116 | 1b3e144 |  0.81      |  297.3     | 13.19744738594 | 8.9e-11        |
+------+---------+------------+------------+----------------+----------------+
| #113 | 8652ace |  0.73      |  297.8     | 13.19744738653 | 1.3e-10        |
+------+---------+------------+------------+----------------+----------------+

**32×32×2 (35 k DOFs) — Δrel dominated by machine precision**

At this size the first mode converges fully; Δrel stays below 1e-9
and the jumps align cleanly with ARPACK's stochastic convergence
floor.  See :file:`perf/snapshots/walk_*.json` for the full table.

What moved the needle
^^^^^^^^^^^^^^^^^^^^^

Ordered by magnitude of wall-time or RSS delta at 80×80×2:

- **#102** (``refactor(model): grid as sole source of truth``) —
  wall 3.87 → 2.72 s (**−30 %**), RSS 521 → 512 MB.  Eliminates
  parallel ``_nodes`` / ``_elements`` dicts that mirrored the
  pyvista grid.  The dict→grid refactor removed one O(n_nodes)
  Python walk per assembly call.
- **#103** (``pardiso size_limit_storage=0``) — RSS 512 → 491 MB
  (**−4 %**).  Stops pypardiso's internal full-``A.copy()`` cache.
- **#104** (``malloc_trim(0) at BC boundary``) — wall 2.75 →
  2.61 s (**−5 %**).  Returns glibc arena scratch between
  assembly and factor.  Bigger wins at larger scales (>100 k
  DOFs) where the element-batch transient is larger.
- **#109** (``drop DOF layout caches``) — no measurable change at
  80×80×2.  Scales with node count; modeled saving was ~0.6 KB
  per node, dominated by the Pardiso factor ceiling at this
  mesh.
- **#110** (``parallel C++ K/M assembly``) — wall 2.61 → 3.29 s
  at this size.  Caveat: ``scikit-build-core`` does not always
  re-compile the C extension between walker checkouts (build
  cached at ~0.9 s); the number above may reflect the
  last-built ``.so`` rather than ``#110``'s actual parallel
  path.  On PTR rotor-scale meshes (mixed-topology) the PR's
  own numbers show 6× speedup at 8 threads.  Re-run the walker
  after ``rm -rf build/`` for a clean per-commit measurement.
- **#112** (``pardiso 1-based ia/ja cached``) — wall 3.29 →
  2.26 s (**−31 %** vs the just-prior sample).  Eliminates
  ~65 MB of alloc/free per solve × ~30 solves per
  ``modal_solve``.
- **#115** (``vectorize _assemble layout lookup``) — measured as
  slightly slower at 80×80×2 here; the win shows cleanly in the
  assembly-only bench (:meth:`_bc_reduce_and_release` is 25-40 %
  faster at mid-ladder per the PR #115 measurements) but the
  3-second end-to-end pipeline is dominated by Pardiso and
  eigsh.
- **#116** (``pardiso skip _check_A/_check_b``) — modest;
  removes per-solve O(n) indptr scan.  Wins scale with solve
  count.
- **#113** (``single-pass symmetric matvec from triu CSR``) —
  wall at this size is ~flat vs #116; on PTR rotor-scale where
  M @ x dominates eigsh iteration cost the PR reports clearer
  wins.

Accuracy — did it regress?
^^^^^^^^^^^^^^^^^^^^^^^^^^

**No.**  First-frequency Δrel vs the #101 baseline stays in the
``1e-15`` → ``1e-9`` range across the whole walk — well below the
ARPACK ``tol=1e-12`` convergence floor, which on a plate modal
problem gives roughly ``1e-7`` of stochastic first-mode drift
between independent solver invocations.  The jumps visible at
individual commits (for example ``#110`` at 32×32×2 moving from
``8e-16`` to ``4.5e-11``) are ARPACK's iteration count shifting
by one or two, not a numerical regression; the :math:`\Delta_{rel}`
value oscillates rather than drifting monotonically toward larger
error.

If an MAPDL-comparison snapshot (see :file:`PERFORMANCE.md`)
shows a larger Δrel at the top of its size ladder — for example
``2e-7`` at 224×224×2 — that is the **ARPACK-vs-Lanczos** gap
against the MAPDL reference, not a progression regression: the
absolute value sits at the ARPACK tolerance floor for every
post-#101 commit on the walker above.

Reproducing
^^^^^^^^^^^

.. code-block:: bash

   # Walk the default commit set (above) and regenerate snapshots.
   python perf/walk_history.py

   # Narrower walk, larger mesh ladder.
   python perf/walk_history.py \
       --commits 5d0b452 ae29ac6 87e6f82 1b3e144 \
       --sizes 64x64x2 96x96x2 128x128x2

Each commit writes :file:`perf/snapshots/walk_<hash>.json` (skipped
if already present so reruns are incremental).  The walker auto-
stashes local edits and restores the starting branch on exit, so
it's safe to run from any branch.

The :file:`PERFORMANCE.md` changelog
-------------------------------------

This page is a curated user-facing view.  For the full dated
progression of every landing — including per-element micro numbers,
older pipeline sweeps, and the exact commit SHAs — see
:file:`PERFORMANCE.md` in the repository root.