Performance — profiling, snapshots, the perf tracker
=====================================================

How performance work flows through the repo: where to profile,
how to capture a snapshot, when to commit one to the perf
tracker, and how to read the running history.

The repo treats performance as a **first-class artefact** — every
optimisation that lands gets a dated row in ``PERFORMANCE.md`` so
the history stays visible.  Old rows are not collapsed; the
point is to see numbers move over time.

.. contents:: Page contents
    :local:
    :depth: 2

Two harnesses
-------------

Performance numbers come from one of two harnesses, both checked
into the repo:

.. list-table::
    :header-rows: 1
    :widths: 22 30 48

    * - Harness
      - Path
      - Drives
    * - **Micro / assembly**
      - ``tests/benchmarks/``
      - ``pytest --benchmark-only`` over the standard 2 × 40 × 40
        SOLID185 flat plate (5 043 nodes, 3 200 hex cells,
        15 129 DOF — see ``tests/integration/_flat_plate.py``).
        Per-element kernels run on a single unit cube / unit tet.
    * - **End-to-end pipeline**
      - ``perf/bench_pipeline.py``
      - Drives a parameterised SOLID185 flat plate (clamped at
        ``X = 0``, 10 modes) through the full
        ``Model.solve_modal`` pipeline with per-stage timing
        and peak-RSS capture.

The micro harness reports median wall time per round (≥ 5 rounds
via pytest-benchmark) so the noise floor is bounded.  The
pipeline harness reports per-stage breakdown so a slowdown
landing in one stage is visible without re-running the whole
suite.

The perf tracker — ``PERFORMANCE.md``
-------------------------------------

A markdown changelog at the repo root.  Every meaningful perf
change gets a new dated section.

What goes in:

* **Each landed optimisation** — date, PR ref, the metric it
  moved (assembly time, peak RSS, modal solve, …), before /
  after numbers.
* **Cross-platform / cross-backend baselines** — when a new
  solver backend lands, capture its numbers against the
  default so future comparisons start from a known floor.
* **Regression diagnoses** — when a regression surfaces, the
  diagnosis lands in ``PERFORMANCE.md`` alongside the fix so
  the trail is auditable.

What does **not** go in:

* Speculative numbers from a half-implemented PR.  Wait until
  it merges, capture the final number, then write the row.
* Numbers from a non-standard machine without explicit
  call-out.  The flat-plate baseline assumes the maintainer's
  reference machine; community contributions noting different
  hardware go in a separate "external timings" subsection.

How to capture a snapshot
-------------------------

Snapshots are dated markdown files under ``perf/snapshots/``
that pin a moment-in-time set of numbers.  They're cited from
``PERFORMANCE.md`` rows that need a richer breakdown than the
one-line summary fits.

Naming: ``perf/snapshots/<short-name>_<YYYY-MM-DD>.md`` (the
short name describes the change — ``mem_after_triu_k``,
``solvers_baseline``, etc.).

When to take one:

* The change touches the **assembly inner loop** or any path
  that runs once per element.  Even a 1 % shift compounds
  across millions of elements.
* The change moves **peak RSS** by more than 5 %.
* A **new solver backend** lands.

To take one:

.. code-block:: bash

    # Micro / per-element kernel times
    pytest tests/benchmarks --benchmark-only \
        --benchmark-save=<short-name> \
        -o python_files=bench_*.py

    # End-to-end pipeline (timing + peak RSS)
    python perf/bench_pipeline.py --output perf/snapshots/<short-name>_$(date +%F).md

The bench_pipeline output is a markdown table; drop it into a
file under ``perf/snapshots/``, add a one-paragraph summary at
the top citing the change that moved the numbers, then link
from a row in ``PERFORMANCE.md``.

The trend tracker
-----------------

``perf/trend/`` holds longer-running series — one file per
metric, appended to over many releases.  Use when:

* The metric you're tracking has more than two data points and
  belongs on a chart.
* The series spans multiple PRs (e.g. assembly time over a
  refactor that lands in three steps).

The format is one row per measurement with date, commit, value,
and any notes.  ``perf/trend/README.md`` codifies the
conventions.

Profiling
---------

Three tools cover what the team profiles:

* ``cProfile`` / ``pyinstrument`` — Python-side bottlenecks,
  per-function timing.  Wrap the smallest reproducer that
  shows the slowdown; don't profile the whole test suite.
* ``py-spy`` — sampling profiler for processes you don't
  want to restart.  Useful for diagnosing a stuck
  ``solve_modal`` that's been running for hours.
* ``memray`` — peak-RSS and allocation tracking.  The
  pipeline harness already wraps this when invoked with
  ``--memory``.

Don't commit profile traces.  Save them under
``/tmp/`` or your scratch dir; if the diagnosis is interesting,
distil to a written summary in the PR description and a
``PERFORMANCE.md`` row.

The release-readiness regression suite
--------------------------------------

The ``release-readiness`` GitHub Action (see
``.github/workflows/release-readiness.yml``) runs on every push
to ``main`` and emits a diff against the previous tagged
release's pipeline-bench numbers.  Regressions ≥ 10 % block the
docs deploy until investigated.

When a regression alert fires:

1. Pull the most recent ``perf/snapshots/`` against the
   regressed metric.
2. Bisect with ``git bisect`` against the ``perf/bench_pipeline.py``
   target.
3. The diagnosis lands in ``PERFORMANCE.md`` with the fix.

Common pitfalls
---------------

* **Testing on a debug build.**  Always use the optimised
  install (``uv pip install -e .``).  A debug build's numbers
  bear no relation to production.
* **Letting the scheduler interfere.**  Before benchmarking,
  pin the process: ``taskset -c 0 ...`` on Linux.  Background
  noise easily moves a 1-second benchmark by 10 %.
* **Using a stale virtualenv after dependency changes.**
  Reinstall (``uv pip install -e .``) after touching
  ``pyproject.toml`` so the benchmark sees the same wheel as
  CI does.
* **Capturing a single run.**  The micro harness runs ≥ 5
  rounds because single-run noise is real.  Don't paste a
  one-shot number into ``PERFORMANCE.md``.
* **Not citing the change that moved the numbers.**  Every
  perf row references the PR / commit that caused the move.
  A row without provenance is folklore.

Where things live
-----------------

.. list-table::
    :header-rows: 1
    :widths: 32 68

    * - Concern
      - Path
    * - Running tracker
      - ``PERFORMANCE.md`` (repo root)
    * - Micro / assembly benchmarks
      - ``tests/benchmarks/`` (pytest-benchmark targets)
    * - End-to-end pipeline harness
      - ``perf/bench_pipeline.py``
    * - Snapshots
      - ``perf/snapshots/<short>_<YYYY-MM-DD>.md``
    * - Trend series
      - ``perf/trend/<metric>.md``
    * - Latest run-of-record
      - ``perf/latest_<area>.md``
    * - Release-readiness CI
      - ``.github/workflows/release-readiness.yml``