← BACK

Edit distance practical application

When was the last time you had a direct implementation for an algorithm?

At work today we needed to filter out malformed email domains from a list of 4.2 Million email ids. e.g gnail.com instead of gmail.com

Looking at the problem, it just occurred to me Oh this can be done using - Levenshtein distance (edit distance)

I decided to use pandas.py as it provided some high level constructs for manipulating csv. Also found this simple package implementing editdistance: editdistance · PyPI

In less than an hour, i hacked a script together using and we processed the 4.2M email ids to get 1342 email domains along with count of email ids using such domains.

When was the last time you had a direct implementation for an algorithm?
At work today we needed to filter out malformed email domains from a list of 4.2 Million email ids.
e.g https://t.co/CtHbFCEc0w vs https://t.co/R8Qi3DgZCD, https://t.co/dIpqGlmaOm vs https://t.co/9SJSZ7sdRT
— Shubham Sethi (@suave_sethi) August 16, 2022