Normalization transforms input into a canonical form; testing its idempotence—normalize(normalize(s)) == normalize(s)—helps catch inconsistent or stateful implementations. Below are pragmatic, implementable strategies and examples you can add to unit and property-based test suites.
1. Define the normalizer contract
State the expected behavior clearly for your domain: which normalization form (NFC/NFD/NFKC/NFKD), whether to case-fold, remove diacritics, collapse whitespace, or apply URL/hostname canonicalization. Distinguish lossless normalizers (should be strictly idempotent) from intentionally lossy ones (where repeated application must be stable but can drop information).
2. Idempotence tests (simple, deterministic)
– Create a small deterministic suite that asserts normalize(s) == normalize(normalize(s)) for representative inputs: ASCII, accented characters, ligatures, combined marks, zero-width characters, mixed-case, and known problematic characters (e.g., Kelvin sign U+212A).
– Include domain-specific examples: email local-parts, filenames, query strings, and percent‑encoded URLs. For URL canonicalizers, normalize and then parse+serialize: url == stringify(parse(url)).
3. Property-based testing (broad coverage)
– Generator design: produce valid Unicode strings plus controlled injections of edge classes (combining marks, non-spacing marks, compatibility characters, supplementary plane codepoints, zero-width joiners/non-joiners, mixed scripts). For URLs, generate host, path, and query fragments including percent-encodings and Punycode variants.
– Properties to assert:
– Idempotence: normalize(normalize(s)) == normalize(s).
– Stability under no-op: if s already satisfies your stored canonical constraints (precomputed examples), normalize(s) == s.
– Round-trip where applicable: for lossless transforms, parse(normalize(s)) should equal parse(s) (or preserve semantic identity).
4. Shrinking and failing-case triage
– When a PBT run finds a counterexample, ensure your test harness minimizes the failing input while preserving the invariant violation (use built-in shrinking or custom shrinkers that avoid removing combining marks needed to reproduce the bug).
– Log both original and normalized forms and the codepoint sequences (e.g., escape as U+XXXX) so you can tell whether the failure arises from ordering, encoding, or platform behavior.
5. Handling lossy normalizers
– If normalization intentionally loses information (e.g., removing diacritics, mapping full-width to ASCII), assert stability rather than equality: normalize(normalize(s)) == normalize(s) still must hold, but normalizer(s) may not equal s.
– Add tests that verify only allowed information is lost (for example, diacritics removed but base letters preserved). Use golden examples to document accepted lossiness.
6. Combined pipelines and ordering checks
– Test common pipelines: normalize → casefold → trim → collapse-whitespace. Verify that changing the order (e.g., casefold before normalization) does not silently break invariants—encode these as tests where applicable.
– For systems that both canonicalize and validate, assert normalization happens before validation: validate(s) may differ from validate(normalize(s)); prefer tests that ensure security checks operate on normalized input.
7. Unicode-specific checks and examples
– Include tests for: precomposed vs decomposed forms (e.g., U+00E9 vs U+0065 U+0301), compatibility characters and ligatures (fi → fi), zero-width characters (ZWJ/ZWNJ), and homoglyph cases (Cyrillic/Latin lookalikes). Use the Unicode Normalization Forms (NFC/NFD/NFKC/NFKD) from the Unicode standard as references.
8. URL and host canonicalization
– Test percent-encoding normalization, case-folding of percent-encoded hex, Punycode ↔ Unicode hostnames, and removal/normalization of dot-segments. Include examples where compatibility normalization changes presentation (full-width slashes, IDEOGRAPHIC FULL STOP) and assert the final serialized URL is stable under repeated normalization.
9. Fuzz + property hybrid
– Pair generators with mutational fuzzers that insert byte sequences, invalid UTF-8, or random percent-encodings to find crashes or inconsistent behavior. Treat crashes as test failures; treat invalid inputs according to your contract (reject vs sanitize).
10. Cross-platform and library-aware assertions
– Different runtimes may implement Unicode functions slightly differently; pin expected normalization behavior in tests to the chosen library (e.g., ICU, Python unicodedata, Java Normalizer) and add a compatibility matrix if you support multiple runtimes.
11. Example quick checklist you can add to CI
– Deterministic idempotence suite (20–50 curated cases).
– Property-based idempotence run (N cases per PR).
– Fuzzing job that mutates serialized inputs for 24–72 hours if security-sensitive.
– Cross-language spot checks when library or platform changes are introduced.
References
Use the Unicode Normalization Forms (Unicode Standard) as the canonical reference when choosing forms and interpreting results.
Sources
- Unicode Standard Annex #15: Unicode Normalization Forms (Unicode Consortium; 2006-05-01; Official source)
- Encoding and Transformations — PayloadsAllTheThings (Unicode Normalization examples) (PayloadsAllTheThings; 2026-01-01)