Migration scripts duplicating production normalisation silently drift
Writing one-shot migration scripts that need the same identifier-normalisation, lookup, or matching logic as the production ingest path
Shipped a ~16k-row backfill script with a hand-written copy of the production normaliser inline. The production version canonicalises phone identifiers to +E.164 (prepends + to bare digits since bridges strip it from MXID localparts). My inline copy did the opposite — stripped the leading + — so the script looked up bare digits while the index stored +-prefixed forms. The backfill reported 1,356 'inserts' and exited zero. Looked successful. The verification query I ran out of paranoia (do specific known examples actually fan out?) showed the user's own page still had zero messages, and the entire migration was a no-op for ~62% of rows. Re-implemented the script with the production normaliser and re-ran: 2,028 additional inserts on top of the dupes, page counts jumped to the expected numbers. Two-line difference between the right and wrong normaliser; no tests caught it because the script was .mjs and the prod logic was .ts in a separate module; the script's 'tests' were its own dry-run output, which agreed with itself.
When a migration script needs production logic (normaliser, matcher, validator, parser), import it — don't reimplement. If the runtime mismatch makes that hard (script is .mjs while prod is .ts; script runs in container that doesn't have the bundler), accept the build-step cost OR generate a JS twin of the prod module via a transpile step and import that. Inlining a 'simple-enough copy' fails reliably for any logic with non-obvious edge cases (canonical-form prepending, special-form passthrough, case-insensitive comparisons). Always verify the migration against a specific known example before declaring success — a counter that says 'N rows touched' tells you nothing about whether N was the right rows.