Conditioning and performance
============================

Two related questions every user asks at some point: "why is
this matrix so ill-conditioned?" and "why is the solve so
slow?"  The answers usually overlap.

Matrix conditioning
-------------------

The condition number :math:`\kappa(\mathbf{K}) = \lambda_\max
/ \lambda_\min` for the symmetric-positive-definite stiffness
is the fundamental sensitivity-of-the-solve number.  Above
~1e10 most direct backends produce noticeable rounding error;
above 1e14 they fail outright.

Sources of poor conditioning, in descending frequency:

* **Geometry / material disparity.**  A model that mixes a
  steel beam with a soft-rubber pad inside the same solve
  routinely hits :math:`\kappa \sim 10^9` — the stiffness
  ratio goes straight into the eigenvalue spread.  Solve
  the two parts as separate problems where possible.
* **Near-zero-Jacobian elements.**  The local stiffness
  blows up as :math:`\det(\mathbf{J}) \to 0`; the global
  condition number follows.  See :doc:`mesh-quality`.
* **Insufficient BCs.**  An under-constrained model has a
  near-zero eigenvalue (the would-be rigid-body mode); the
  condition number is effectively infinite.  The
  :doc:`bc-pitfalls` rigid-body-mode check catches this
  before the solve.
* **Sliver elements** (very thin cells next to thick
  neighbours) — local-to-global stiffness mismatch.  Refine
  or split.

Reading the diagnostic
~~~~~~~~~~~~~~~~~~~~~~

When a direct solve fails or returns garbage, the linear
backend reports a condition-number estimate (often as ``rcond``
— the reciprocal).  Pardiso, CHOLMOD, and MUMPS all expose
this through the ``Result.diagnostics`` dict.  Reading
``rcond < 1e-10`` is the same signal as :math:`\kappa > 10^{10}`.

Performance tuning
------------------

For the linear-elastic structural slice femorph-solver ships
today, the wall-time bottleneck is almost always the linear
factorisation.  The factor cost depends on three things:

1. **Backend choice** (Pardiso / CHOLMOD / MUMPS / SuperLU).
2. **Thread count** (``OMP_NUM_THREADS`` and friends).
3. **Matrix structure** (DOF count and connectivity / fill-in).

When to switch backends
~~~~~~~~~~~~~~~~~~~~~~~

* **Default** — femorph-solver's auto-chain picks the fastest
  installed backend (Pardiso > CHOLMOD > MUMPS > SuperLU).
  Most workflows don't need to override.
* **Pardiso** — fastest on Intel-class CPUs (the BLAS-pinned
  configuration ARC pods + most laptops use).  Two-pass
  factorisation outperforms CHOLMOD on dense connectivity
  (3-D solid models).
* **CHOLMOD** — competitive with Pardiso on sparse-graph
  models (beam-shell meshes), but pays a constant fill-in
  overhead Pardiso doesn't.  Pick CHOLMOD when AMD CPUs are
  in play and Pardiso's MKL is an issue.
* **MUMPS** — preferred when out-of-core (large models that
  don't fit in RAM); supports disk spillover.  Slower than
  Pardiso / CHOLMOD on in-RAM problems.
* **SuperLU** — fallback only.  Pure-SciPy, no external
  dep, ~2-3× slower than the optimised backends.

Switch via ``Model.solve_static(linear_solver="cholmod")`` etc.  See
:doc:`/reference/solvers/linear_backends` for the full
backend-selection API.

Thread pinning
~~~~~~~~~~~~~~

The library reads ``OMP_NUM_THREADS``, ``MKL_NUM_THREADS``,
and ``OPENBLAS_NUM_THREADS`` for BLAS-pool sizing.  CI pods
pin all four to ``4`` because the cgroup quota gives ~4 vCPUs
— larger pools just thrash from scheduler contention.

Local guidance:

* **Laptop / 8-core workstation** — set
  ``OMP_NUM_THREADS=4`` for most workloads.  Leave 4 cores
  for the OS / your IDE.
* **Beefy workstation / cluster node (32+ cores)** — set
  ``OMP_NUM_THREADS=8`` or ``16``; diminishing returns above
  16 for typical structural-mechanics K matrices.

Modal solves
~~~~~~~~~~~~

The eigensolver dominates wall-time on modal analyses.
``scipy.sparse.linalg.eigsh`` calls the linear backend once
per Lanczos iteration — so eigensolver speed is dominated by
the linear backend choice.  All the linear-backend guidance
above applies to modal solves the same way.

For very large models (> 1 M DOF) consider
``Model.solve_modal(eigen_solver="lobpcg")`` which avoids the
explicit factorisation but has slower convergence on
clustered low frequencies.

Memory
------

Direct factor memory scales as :math:`O(N^{1.5})` for 3-D
solid meshes (Saad 2003 §3.6).  For 100 k-cell HEX8 meshes
that's typically 8-15 GB — fits in workstation RAM.  Beyond
500 k cells the factor outgrows 32 GB and you'll need to
either move to MUMPS out-of-core or drop to an iterative
backend (CG with AMG preconditioner — currently shipped via
``linear_solver="cg+pyamg"``).