Principal numerical and empirical claims in the README, mapped to evidence artifacts, reproduction commands, expected tolerances, and limitations. Per-arm statistics in the Empirical Evidence tables are sourced directly from the JSON artifacts listed below; this document maps the summary-level claims, not every individual table cell.
Methodology
Claims are extracted from README.md as of the current commit.
Each claim must have a JSON artifact in notebooks/results/ or a CI command that produces it.
Tolerances account for hardware-specific floating-point variation (see REPRODUCING.md §Expected Numerical Variation).
“Survives global BH” indicates whether the claim’s p-value survives investigation-wide Benjamini-Hochberg correction across all 76 comparisons (see STATISTICAL_SUMMARY.md).
Claim 1: “497 tests, 100% line coverage when run with full dependencies”
CI reports coverage on every push; 100% requires full dependencies including torch
Tolerance
Exact: 497 test functions as counted by grep -c "def test_" tests/*.py
Limitation
CI installs .[dev] (no torch), so topogeoml/nn/ code paths are not exercised in CI and coverage is below 100% in that environment. 100% coverage is achieved when torch is installed (pip install -e ".[all]"). __init__.py files are omitted per pyproject.toml [tool.coverage.run]. Coverage is reported in CI but not gated because the torch-less CI environment cannot achieve 100%.
median_diff: 0.086 +/- 0.005; p_BH: 4.83e-3 +/- factor of 2
Survives global BH
Yes (rank 28/76, threshold 1.84e-2)
Survives Bonferroni
No (threshold 6.58e-4)
Limitation
One dataset (NCI1), one configuration (1-layer, hidden_dim=32, 10 epochs). Does not replicate on MUTAG or PROTEINS at this configuration. Subsequent ablation (H008-c) showed the operative factor is the external residual, not the Hodge Laplacian.
Claim 3: “topology-aware message passing with external residual outperforms MLP by 8-10 pp”
MUTAG: gap +0.098, p = 4.53e-6; PROTEINS: gap +0.088, p = 1.41e-4; NCI1: gap +0.071, p = 1.93e-5
Survives global BH
Yes (all three)
Limitation
These p-values are from the Hodge-vs-class-prior comparison within the H006 resolver, not the Hodge-vs-MLP comparison in the raw JSON. The class prior is the theoretical baseline (majority-class accuracy), not the MLP’s constant-feature accuracy.
Claim 6: “100% coverage on the library and benchmark framework”
Locally with full dependencies: pip install -e ".[all]" && pytest --cov=topogeoml --cov=benchmarks --cov-fail-under=100
Limitation
This gate is enforceable only with full dependencies (including torch). CI installs .[dev] (no torch) and reports coverage without gating it. The 100% claim applies to the full-dependency environment only.
Claim 7: “preregistered hypothesis series (H001-H011, 50+ falsifiable sub-predictions)”
git log --format="%H %ai" -- docs/hypotheses/HYPOTHESIS-008-gin-gat-comparison.md | tail -1 — commit timestamp precedes experiment result timestamp. Replace the filename with any hypothesis document to verify.
Limitation
Hypothesis selection was sequential (each informed by the prior). This is acknowledged in STATISTICAL_SUMMARY.md §4 as legitimate sequential testing, not p-hacking.
Claims not yet independently validated
The following claims have not been reproduced outside the original compute environment:
All per-seed accuracies (hardware-dependent floating-point variation expected)
The investigation-wide BH analysis (computed from the archived JSON artifacts; a third party should re-run the analysis script to verify)