Skip to content

Mutation testing — methodology and gate

Why this exists

Coverage measures what code lines run; mutation testing measures what bugs the suite catches. A line covered by a test that doesn't assert anything still counts as covered. Mutation testing closes that gap by systematically introducing small bugs (mutating an operator, a constant, a return value) and re-running the suite; a mutant that breaks no test is a survived mutant — by definition, an under-tested code path.

This file is the public, auditable record of the project's mutation score. It is regenerated by .github/workflows/mutation.yml on a weekly schedule; the gate is enforced in CI.

Gate

Aspirational: ≥ 95% kill rate on every release. The 5% buffer absorbs equivalent-mutant noise — mutations that produce semantically identical code (e.g. x and y -> y and x on commutative operations, constants only used as a sentinel comparison).

Current state (v1.1.6): measurement-first. The gate threshold is temporarily 0.0 — we record the actual measured kill rate per module without failing the workflow on it, so we can establish an honest baseline. As of the v1.1.6 release (following the 2026-04-28 uplift runs), 4 of 6 modules have a measured kill rate above zero: cache 82 %, proteinchem 92 %, client 70 %, formatters 62 % (numbers reconciled against the per-module table below). The two largest modules (formatters, server) timed out mid-pass on their longest runs and need bisection work to fully measure. The matrix workflow (.github/workflows/mutation.yml) runs weekly and on demand; subsequent runs incrementally close gaps. Once every module is fully measured AND a targeted uplift PR has tightened the survivors toward ≥ 95 %, the gate is raised to 0.95 and enforced (v1.2.0 target).

If the eventual gate fails, the workflow fails and the PR cannot merge to main. The remediation is to write a test, not to lower the gate.

Tool

mutmut 2.x (pinned). Selected over cosmic-ray and mutpy for ergonomics and CI ecosystem. mutmut 3.x changed the CLI substantially (no --paths-to-mutate flag, removed mutmut html, config-only via [tool.mutmut]); we pin to 2.x until the 3.x ecosystem settles.

Scope

  • src/uniprot_mcp/ — every module, measured per-module via the Actions matrix workflow. Per-module results land in the table below.
  • Test suite for kill-detection: pytest tests/unit tests/property tests/client tests/contract (the offline suite — mutmut runs must not contact the live UniProt API; integration tests are excluded).
  • --use-coverage: each mutant only re-runs the tests that actually exercise the mutated line. Massive wall-time reduction over the default "run the full suite per mutant" mode.

Excluded mutations

  • __version__ is sourced from importlib.metadata and is not meaningful to mutate.
  • String literals in user-facing error messages are excluded; mutating them produces changes the suite can't reasonably catch (errors containing "valid" -> "valig" don't change behaviour). Mutmut's defaults handle this correctly.

Local development

mutmut 3.x does not support Windows (issue 397). On Windows, either:

  • Use WSL: wsl bash -c "cd /mnt/c/TOPOLOGICA/UNIPROT_MCP && pip install mutmut && mutmut run --paths-to-mutate=src/uniprot_mcp"
  • Or defer to the CI workflow.

On macOS and Linux:

pip install -e ".[test,dev]"
pip install mutmut
mutmut run --paths-to-mutate=src/uniprot_mcp
mutmut results          # listing of survived mutants
mutmut show <mutant_id> # diff of a specific surviving mutant

Scores

Per-module kill rates from the matrix workflow. Each row is one parallel job in .github/workflows/mutation.yml. After every scheduled or on-demand run, copy the table from the mutmut-summary workflow artefact into the table below.

v1.1.2 + client uplift (run 25072369933, 2026-04-28 19:08 UTC)

After the proteinchem uplift below, a second targeted uplift PR (branch fix/client-mutation-uplift) was applied to client.py in two phases. The trajectory is recorded honestly because each phase revealed a different gap in the test surface:

Phase 1 — sync killers (commit b5ab1a8): tests/unit/test_client_mutation_killers.py — 142 parametrised + standalone tests pinning every module constant via direct equality, every regex via valid/invalid examples (5–13 each), parse_retry_after via 16 hardcoded (input, expected) cases, canonical_response_hash via 11 snapshot hashes, _extract_provenance via four synthetic httpx.Response objects, ReleaseMismatchError message format, and UniProtClient construction. Result on the narrow scope: 62.7 % raw (232/370). Then expanding the matrix tests: field from 14 listed files to the full tests/unit + tests/property (commit 1f9824f) lifted it slightly to 63.24 % (234/370, +4.3 pp).

The reason the lift was small: decoding the 136 surviving mutmut IDs against the local .mutmut-cache showed every survivor was inside an async method body that no test in the suite actually invokes — the sync killer file pinned constants and pure helpers correctly, but the bulk of client.py's mutation surface lives in the _req retry loop, the pin-release branch, the thin get_* / search_* async wrappers, id_mapping_submit / id_mapping_results, batch_entries filtering, and the cross-origin get_clinvar_records / get_alphafold_summary flows. The methodology fingerprinted its own gap.

Phase 2 — async killers (commit 390a54d): tests/unit/test_client_async_killers.py — 49 respx-mocked async tests pinning each surviving location: per-wrapper URL/method/path assertions, retry-loop behaviour (429 → retry; 5xx → retry; all 5xx → RuntimeError after MAX_RETRIES + 1 attempts; matched-pin → success; mismatched-pin → ReleaseMismatchError), id_mapping flow (POST + polling + redirectURL follow), batch_entries client-side filter / 100-cap / OR-join, get_clinvar_records two-step + idlist-empty short-circuit, get_alphafold_summary version-rendering. Result: 70.00 % raw (259/370, +6.76 pp on phase 1, +11.08 pp on baseline).

Final v1.1.2-uplift table:

module killed survived total wall time raw kill rate Δ vs baseline
cache 23 5 28 1m59s 82.1 % (≈100 % behavioural) unchanged (no new cache tests)
proteinchem 228 21 249 7m21s 91.6 % +55.9 pp (was 35.7 %)
client 259 111 370 2h31m 70.00 % +11.08 pp (was 58.92 %)

Why client landed at 70 % rather than ≥85 %: the remaining 111 mutants split into three categories (decoded via the local .mutmut-cache against the v4 survivor list):

  • Equivalent / hard-to-kill mutants (~15–20): @property decorator mutations on lines 277, 282; ternary-branch variants on observed_disp = observed if observed is not None else "(absent)"; generator-expression equivalents in " OR ".join(f"accession:{a}" for a in valid).
  • Loose-assertion gaps (~40–50): existing tests assert the right behaviour but with substring rather than exact-equality checks — e.g., "fasta" in accept.lower() survives a "XXfasta...XX" wrap. Tightening these assertions is the highest- ROI follow-up.
  • Untested code paths (~40): the id_mapping_submit retry loop's 429/5xx branches; get_alphafold_summary's latestVersion-absent fallback; get_clinvar_records's provenance-block construction. Phase 2 covers the success paths but not all the retry / fallback branches.

Hitting ≥85 % on client requires another iteration that tightens the loose assertions and adds tests for the untested branches. Estimated cost: ~1.5–2 h of test writing + ~3-h CI cycle.

v1.1.2 + proteinchem uplift (run 25032660208, 2026-04-28 03:42 UTC)

A targeted uplift PR (branch fix/proteinchem-mutation-uplift) added tests/unit/test_proteinchem_mutation_killers.py (~100 parametrised single-residue assertions pinning every entry of _RESIDUE_MASS, _KYTE_DOOLITTLE, the side-chain pK dicts, the extinction-coefficient magic numbers 1490 / 5500 / 125, the N/C-terminus pKs, and STANDARD_AA) and tightened the four loose-tolerance assertions in tests/unit/test_round_one_clinical.py (abs_tol 0.01 / 1e-3 → 1e-6). Re-run on the same per-test-file- scoped matrix:

module killed survived total wall time raw kill rate Δ vs baseline
cache 23 5 28 1m59s 82.1 % (≈100 % behavioural) unchanged (no new cache tests)
proteinchem 228 21 249 7m21s 91.6 % +55.9 pp (was 35.7 %)

The 21 surviving proteinchem mutants are concentrated in the module docstring (lines 1-20) and the reference-data block (lines 26-32 / 56-57 / 80-94, which are the inline comments above the _RESIDUE_MASS, _KYTE_DOOLITTLE, and _PK_* dicts). All are equivalent mutants — mutmut 2.x mutating string literals in """…""" blocks or # … comments cannot change runtime behaviour. Behavioural kill rate is therefore effectively 100 % on proteinchem.py.

Workflow note: prior to this run, mutmut's exit-code-2-on-survivors behaviour combined with the GitHub Actions default bash -e short-circuited the per-job step before the post-run capture/upload steps ran, so each module's job reported failure even though mutmut completed cleanly. Fixed in the same uplift commit:

  • || true masks mutmut's non-zero exit on the pipeline
  • if: always() on the Capture / Compute / Save / Upload steps so the artefact lands regardless of mutmut's exit code

v1.1.2 baseline (run 25015528542, 2026-04-27 19:37 UTC)

Per-test-file scoping replaced the v1.1.0 "run the entire tests/unit/ per mutant" approach: each matrix job now invokes mutmut with --runner='pytest <files-that-import-this-module>'. Result: 4 of 6 modules now complete (vs 2/6 in the v1.1.0 baseline); 4 of 6 produce non-trivial measured kill rates including the cache module that previously read 0 % (the v1.1.0 "0 %" was a parser bug — the runtime log carried real numbers but mutmut results didn't surface them; fixed in commit d1050ad).

module killed timeout suspicious survived total wall time kill rate note
__init__ 0 0 0 0 0 1m38s n/a file too small to generate mutants (just imports + importlib.metadata lookup)
cache 23 0 0 5 28 2m04s 82.1 % (raw) / ~100 % behavioural the 5 survivors are all in the module docstring (lines 1, 15-16, 20-21) — equivalent mutants by definition (mutating a docstring can't change runtime behaviour)
proteinchem 89 0 0 160 249 7m12s 35.7 % mix of docstring/comment equivalents and real test gaps; many tests use math.isclose(..., abs_tol=0.01) which can't kill small constant mutations within tolerance — uplift work tracked below
client 218 0 0 152 370 1h51m 58.9 % first time client has been measured at all — the v1.1.0 run timed out before processing any client mutants
formatters 512 0 0 320 832 of 2097 3h00m (timeout) 61.5 % on first 832 mutants partial — 60 % of formatters.py is unmeasured; a complete pass needs further bisection (see v1.2.0 action item)
server 190 0 0 163 353 of 1318 3h00m (timeout) 53.8 % on first 353 mutants partial — 73 % of server.py is unmeasured; same constraint as formatters

Effective behavioural kill rates (after excluding docstring equivalent-mutant noise; rough estimate from inspecting which survivor lines fall inside """...""" blocks):

  • cache100 % (all 5 survivors are docstring lines)
  • proteinchem55-65 % (lines 5-12 + 15-42 + 45-62 are docstring + reference data; lines 192-249 are real code)
  • client65-75 % (estimate; survivor analysis pending)
  • formatters, server — too partial to estimate behaviourally

The aspirational ≥95 % gate is set against raw kill rate; the 5 % buffer is precisely there to absorb equivalent-mutant noise. mutmut 2.x has no built-in --exclude-docstrings flag.

Action items for v1.1.x → v1.2.0:

  1. ~~proteinchem constant-tolerance uplift~~ — DONE (run 25032660208, 2026-04-28). Raw kill rate 35.7 % → 91.6 %; behavioural ≈ 100 % (remaining 21 survivors are docstring + inline-comment equivalent mutants).
  2. ~~client behavioural-survivor analysis~~ — PARTIALLY DONE (runs 25049571013 + 25072369933, 2026-04-28). Phase 1 (sync killers) + phase 2 (async killers) raised client raw kill rate from 58.92 % → 70.00 % (+11.08 pp). Still short of the ≥85 % target. v1.2.0 follow-up: tighten the loose-assertion gaps in test_client_async_killers.py (substring → exact-equality on Accept headers, key-name pinning on resp.json()["jobId"], retry-branch coverage in id_mapping_submit, fallback branch coverage in get_alphafold_summary).
  3. formatters + server bisection — these two modules carry ~67 % of the project's mutants. Either:
  4. split each into 3-4 matrix entries by function/class (--paths-to-mutate=src/uniprot_mcp/server.py::uniprot_get_entry etc.; mutmut 2.x supports the module::function syntax), OR
  5. port to mutmut 3.x once its CLI ergonomics settle (it supports parallel mutant execution which would solve this directly).

Historical runs

Run Date Modules completed Notes
25072369933 2026-04-28 client (sync + async killers, 70.00 %); cache, proteinchem unchanged Phase 2 of client uplift (async killers + full tests/unit scope); 2h31m wall time
25049571013 2026-04-28 client (sync killers + full scope, 63.24 %) Phase 1 of client uplift; +0.5 pp over narrow-scope showed survivors are async-only
25034689002 2026-04-28 client (sync killers + narrow scope, 62.70 %) First client uplift attempt; misconfigured runner scope
25032660208 2026-04-28 proteinchem 91.6 %, cache 82.1 % proteinchem uplift; +55.9 pp
25015528542 2026-04-27 __init__, cache, proteinchem, client (4/6); formatters, server partial First per-test-file-scoped run; 4 modules complete; first measured numbers above 0 % for cache, proteinchem, client
24965548283 2026-04-26 __init__, cache (2/6) First matrix; 4 modules timed out at 90 min; cache reported as 0 % due to parser bug (real number was already 23/28 = 82 %)

Why this matters for adoption

A regulated bio-pharma compliance officer evaluating this MCP for use in 2030 will look for measurable signals of test-suite quality. Line coverage at 100% is necessary but not sufficient — mutation testing closes the "covered-but-not-asserted" loophole.

The gate is >= 95%, not 100%, because:

  • 100% kill rate is achievable only by rejecting all equivalent mutants — a manual, judgement-laden process that introduces more bias than it removes.
  • 95% is the threshold above which empirical studies (e.g. DeMillo, Lipton, Sayward 1978; more recent surveys converge on the same range) suggest the diminishing returns kick in: the marginal cost of writing a test to kill the remaining 5% exceeds the marginal value, and the suite quality at

    = 95% is comparable to manual peer review.

If you are adopting uniprot-mcp and want this gate higher, file an issue with the proposed threshold and your reasoning; we will consider it for the next major release.