Practical Strategies for Testing Input Normalizers

Normalization transforms input into a canonical form; testing its idempotence—normalize(normalize(s)) == normalize(s)—helps catch inconsistent or stateful implementations. Below are pragmatic, implementable strategies and examples you can add to unit and property-based test suites.

1. Define the normalizer contract

State the expected behavior clearly for your domain: which normalization form (NFC/NFD/NFKC/NFKD), whether to case-fold, remove diacritics, collapse whitespace, or apply URL/hostname canonicalization. Distinguish lossless normalizers (should be strictly idempotent) from intentionally lossy ones (where repeated application must be stable but can drop information).

2. Idempotence tests (simple, deterministic)

– Create a small deterministic suite that asserts normalize(s) == normalize(normalize(s)) for representative inputs: ASCII, accented characters, ligatures, combined marks, zero-width characters, mixed-case, and known problematic characters (e.g., Kelvin sign U+212A).

– Include domain-specific examples: email local-parts, filenames, query strings, and percent‑encoded URLs. For URL canonicalizers, normalize and then parse+serialize: url == stringify(parse(url)).

3. Property-based testing (broad coverage)

– Generator design: produce valid Unicode strings plus controlled injections of edge classes (combining marks, non-spacing marks, compatibility characters, supplementary plane codepoints, zero-width joiners/non-joiners, mixed scripts). For URLs, generate host, path, and query fragments including percent-encodings and Punycode variants.

– Properties to assert:

– Idempotence: normalize(normalize(s)) == normalize(s).

– Stability under no-op: if s already satisfies your stored canonical constraints (precomputed examples), normalize(s) == s.

– Round-trip where applicable: for lossless transforms, parse(normalize(s)) should equal parse(s) (or preserve semantic identity).

4. Shrinking and failing-case triage

– When a PBT run finds a counterexample, ensure your test harness minimizes the failing input while preserving the invariant violation (use built-in shrinking or custom shrinkers that avoid removing combining marks needed to reproduce the bug).

– Log both original and normalized forms and the codepoint sequences (e.g., escape as U+XXXX) so you can tell whether the failure arises from ordering, encoding, or platform behavior.

5. Handling lossy normalizers

– If normalization intentionally loses information (e.g., removing diacritics, mapping full-width to ASCII), assert stability rather than equality: normalize(normalize(s)) == normalize(s) still must hold, but normalizer(s) may not equal s.

– Add tests that verify only allowed information is lost (for example, diacritics removed but base letters preserved). Use golden examples to document accepted lossiness.

6. Combined pipelines and ordering checks

– Test common pipelines: normalize → casefold → trim → collapse-whitespace. Verify that changing the order (e.g., casefold before normalization) does not silently break invariants—encode these as tests where applicable.

– For systems that both canonicalize and validate, assert normalization happens before validation: validate(s) may differ from validate(normalize(s)); prefer tests that ensure security checks operate on normalized input.

7. Unicode-specific checks and examples

– Include tests for: precomposed vs decomposed forms (e.g., U+00E9 vs U+0065 U+0301), compatibility characters and ligatures (fi → fi), zero-width characters (ZWJ/ZWNJ), and homoglyph cases (Cyrillic/Latin lookalikes). Use the Unicode Normalization Forms (NFC/NFD/NFKC/NFKD) from the Unicode standard as references.

8. URL and host canonicalization

– Test percent-encoding normalization, case-folding of percent-encoded hex, Punycode ↔ Unicode hostnames, and removal/normalization of dot-segments. Include examples where compatibility normalization changes presentation (full-width slashes, IDEOGRAPHIC FULL STOP) and assert the final serialized URL is stable under repeated normalization.

9. Fuzz + property hybrid

– Pair generators with mutational fuzzers that insert byte sequences, invalid UTF-8, or random percent-encodings to find crashes or inconsistent behavior. Treat crashes as test failures; treat invalid inputs according to your contract (reject vs sanitize).

10. Cross-platform and library-aware assertions

– Different runtimes may implement Unicode functions slightly differently; pin expected normalization behavior in tests to the chosen library (e.g., ICU, Python unicodedata, Java Normalizer) and add a compatibility matrix if you support multiple runtimes.

11. Example quick checklist you can add to CI

– Deterministic idempotence suite (20–50 curated cases).

– Property-based idempotence run (N cases per PR).

– Fuzzing job that mutates serialized inputs for 24–72 hours if security-sensitive.

– Cross-language spot checks when library or platform changes are introduced.

References

Use the Unicode Normalization Forms (Unicode Standard) as the canonical reference when choosing forms and interpreting results.

Sources

l Slovenčina