Validation data store#

Note

Internal scope. This page documents how femorph-solver handles vendor-solver output binaries (the .rst / .full / .emat files a licensed Ansys / NASTRAN / Abaqus run produces against our own re-authored input decks). These binaries are tools for cross-checking correctness; they are not user-visible.

The complete operational rules — file-size discipline, versioning, upload workflow — live in the validation-data Claude skill at .claude/skills/validation-data.md. This page is architecture and rationale.

Why this exists#

The repo must contain zero vendor-authored content (per Provenance inventory, Fixtures and decks, and SOURCES.md). But a substantial fraction of the verification suite asserts on vendor-solver output — modal frequencies, displacement fields, stress recoveries — to confirm that femorph-solver agrees with the established state of the art when both run the same physics.

Those assertions need access to the vendor-output binaries on every CI run and on every developer machine. The options were:

Commit the binaries to the main repo. Rejected: an external IP audit grep’ing for vendor strings finds them in the committed RST headers; the boundary “no vendor content in main” stops being trivially provable.
Run a licensed Ansys in CI to regenerate every run. Rejected: every CI build would need an Ansys license, a docker image, and ~10 minutes of solver wall-time per case. Cost is prohibitive; CI latency would jump from minutes to hours.
Skip every vendor-residual test in CI; only run them on demand. Rejected: regressions in the residual comparisons go undetected.
Host the binaries in a private artifact store and fetch on demand. Selected.

The boundary#

Artifact	Lives where	In main repo?
Vendor verbatim input deck (Ansys VM, MSC VG, Abaqus example)	Local dev machine only (e.g. `~/projects/mapdl-re/vm2025r1_mapdl/`)	Never
Hand-re-authored input deck (the `vm5.dat` pattern)	`tests/interop/<vendor>/fixtures/`	Yes
Vendor-solver binary output (`.rst` / `.full` / `.emat` / `.mode` / `.rdsp` / `.rths` / `.rfrq` / `.esav`)	Cloudflare R2 bucket `femorph-solver-validation-data`	Never
Test code that asserts via the loader	`tests/.../test_*.py`	Yes
Per-case manifest (`<solver>/<case>/manifest.yaml`)	R2, at the per-case folder root	Never in main

The boundary is sharp: an external auditor running find tests -name '*.rst*' -o -name '*.full*' -o -name '*.emat*' on the main repo returns nothing, ever. Vendor outputs exist; they live one git-clone-boundary removed from the solver code.

Why Cloudflare R2#

Three properties matter and R2 has all of them:

Zero egress fees. CI runs pull blobs on every build; egress-billed stores compound with build frequency. R2’s egress price is $0/GB across the whole tier.
S3-compatible API. Standard tooling (boto3, s3fs, aws s3 cp, rclone) works unchanged. The loader uses boto3 against R2’s S3 endpoint; if we ever migrate to AWS S3 / Backblaze B2 / MinIO the change is one constant.
Simple, scoped credentials. R2 API tokens can be bucket-restricted, time-limited, and read-only or read-write. Per-environment scoping (CI vs. dev) is a 30-second console action.

We considered:

GitHub Releases on a private companion repo — free, same auth model as code, no per-file > 2 GB constraint in our regime. Strong runner-up. R2 won on ergonomics: GitHub’s release-asset UX is built for human-curated releases of compiled artifacts, not for hundreds of programmatic uploads.
Railway volume + caddy file-server — workable but adds always-on compute cost (~$5–10/mo) for a fundamentally object-store workload.

Versioning#

The bucket is keyed by <solver>/<case-id>/... so two vendors’ outputs for the same case ID never collide (an Ansys MAPDL vm5 and a NASTRAN vm5 each get their own folder). Every upload writes to a new path under that prefix:

<solver>/<case-id>/manifest.yaml
<solver>/<case-id>/<YYYY-MM-DD>-<short-sha>/<blob>

The case manifest pins one path as the active_version; test code fetches by (solver, case_id) and reads the active pin from the manifest. Pinning lives in the manifest, not in the test, so a regenerated dataset’s pin update lands as a single PR touching the manifest — reviewable, atomic, and rollback-friendly.

The manifest also lists every input file that fed the run (the deck plus any .inc headers it /INPUT``s) under ``inputs: with each file’s path and SHA-256. At fetch time the loader does not re-verify those SHAs (the deck is in the repo and the auditor checks it; computing SHAs at test-collect time is wasted I/O), but a regeneration workflow that updates a deck must update its SHA in the manifest, so drift between deck and dataset is reviewable in the PR diff.

The flow when a test’s expected baseline must change:

Re-run the licensed Ansys against the verbatim vendor input deck (e.g. /ansys_inc/v150/ansys/data/verif/vmN.dat).
python -m femorph_solver._validation_data_upload <case_id> --solver ansys-mapdl --input <repo-relative-deck> --blob <verbatim-vendor-deck> --blob <vendor-output> --generator-vendor ansys-mapdl --generator-version <vN> — computes SHA-256 of every blob, derives the immutable YYYY-MM-DD-<short-sha> version label, refuses to overwrite an existing version, writes blobs + manifest in one logical transaction, and prints the new pin.
Test PR pins the test to the new active_version, updates the assertion’s expected band if the physics changed, lands.

The uploader enforces the per-blob size discipline at upload time (silent ≤ 10 MB; warns 10–50 MB; refuses 50–500 MB without --size-exception <reason>; hard cap at 500 MB). Tests don’t re-check sizes on fetch — the discipline lives at the upload boundary so CI cycles aren’t spent on it.

The active_version is never overwritten: the previous pin remains fetchable forever, so a bisect or a regression investigation can fetch any historical baseline.

Fetch lifecycle in CI#

[pytest collect] → first call to fetch_or_skip("vm5", solver="ansys-mapdl", suffix=".rst")
                → check ~/.cache/femorph-solver/validation-data/ansys-mapdl/vm5/<active_version>/vm5.rst
                → cache miss
                → check R2 credentials in env
                → boto3 GET to R2 https://<account>.r2.cloudflarestorage.com/ansys-mapdl/vm5/<active_version>/vm5.rst
                → write to cache, then SHA-256 verify against the manifest entry
                → mismatch removes the cache entry and raises (no poisoned cache)
                → return Path
[pytest collect] → next call (vm5, .full)
                → cache hit (SHA matches the manifest, no R2 round-trip) → return Path
[next CI run]   → cache miss (new ephemeral runner) → re-fetch (free egress)

Local development uses the same code path; the cache persists across pytest runs so only the first invocation hits R2. rm -rf ~/.cache/femorph-solver/ is the “refresh everything” knob if a contributor suspects a corrupt cache.

Skip-if-absent#

A contributor without R2 credentials still runs the unit suite cleanly. fetch_or_skip(...) raises pytest.skip(...) with a clear message if R2 is unreachable or credentials are missing — the test is marked SKIPPED, not failed, and the rest of the suite proceeds.

This is intentional: the loader exists because vendor residuals are an additional pillar on top of the first-principles regression tests in tests/analytical/ and tests/validation/. Those pillars stand alone — they catch most regressions — and the vendor-residual layer is the third independent check.

Size discipline#

The size rules in the validation-data skill (≤ 10 MB silent / 10–50 MB warned / 50–500 MB exception-only / > 500 MB rejected) are not arbitrary. At our forecast corpus growth (roughly one new case per kernel feature landing, with ~3 binaries averaging 2 MB per case), an indefinitely growing corpus stays well inside the R2 free tier for several years. The same corpus with no size discipline could be 100 GB in months.

The right answer is almost always to reduce the asserted domain:

A 200×200 plate exercises the same kernel as a 20×20 plate; pick the smaller for the assertion’s purpose.
Stress recovery on the singular-point cluster of nodes is enough; the full nodal stress array is not.
Ten low-frequency modes are enough; the full 100-mode spectrum is excess.

The CLI’s --field-restrict-to-set / --mode-cap flags exist to make the small-version path ergonomic.

Authentication#

R2 uses S3-compatible access keys. They live in:

GitHub Actions: repo secrets R2_ACCOUNT_ID, R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY. The test workflow exports them to the test job.
Local dev: developer’s own ~/.envrc.femorph (or equivalent), gitignored. Same env-var names.
Loader: reads from env vars only. Bucket name is hardcoded.

Credentials are never in source, never in test fixtures, never in commit history. The upload CLI requires the write-scoped credential pair; the fetch path uses read-only when both kinds are configured.

Audit posture#

Agent 4’s quarterly four-axis sourcing audit (workspace-level SOURCING_AUDIT.md) explicitly verifies:

find tests -name '*.rst*' -o -name '*.full*' -o -name '*.emat*' -o -name '*.mode*' returns empty in main repo.
The femorph-solver-validation-data bucket contains only blobs whose case manifest cites a re-authored input deck under tests/interop/<vendor>/fixtures/ — never a vendor verbatim input.
The pre-commit hook rejecting > 1 MB binaries committed to tests/ is wired and passing.

The audit-trail-of-the-audit-trail.