DeDuplicate Removing Depulicate Records

EZ24x7 de-dup Functional Specification

Reading the serviced list file, "a_list.csv", finds duplicate address records and writing the best unique address records to "a_ddup.csv" and the duplicates to "x_ddup.csv". For a functional description of how addresses and names are evaluated, continue with the three titles “Judging…” that follow. “User Interface” concludes the article.

Judging Duplicate – Addresses

Within buildings, if the ZIP and delivery lines match up to the secondary sequence, it is considered an address match for EZ24x7 de-duplication.

All other addresses must match ZIP+4 and delivery point. This is necessary because Puerto Rico street records often need urbanization. Building addresses in Puerto Rico do not use urbanization name.

Judging Duplicate – Names

Fuzzy matching techniques described below by letter are referred to in matching logic notes to indicate which of these techniques are employed at each step identifying duplicate names.

A – Abbreviations are used when matching first and middle names.

B – Business word alternates based on CASS USPS list are considered comparing business names.

N – Nick names – A USPS published list of nicknames is used to expand name possibilities. If the test name is in the list of the control nick names, it is a match.

L – Levenshtein distance measures the difference between two sequences – “Google” for details.

P – Philips dual-metaphone encoding is used to compare phonetic variants – “Google” for details.

If either name contains a business word, AES CASS logic using (A, B, L and P) to measure confidence matching business names. If confidence is greater than or equal to 75%…

Return Duplicate

If either name is missing first or last name and the entire name begins with the same letter, use the same AES CASS logic described above.  If confidence is greater than or equal to 85%…

Return Duplicate

If both names do not contain at least first and last names…

Return NOT Duplicate

If both names have a suffix in the group {"JR","SR", "II", "III", "IV"}  and the suffixes do not match…

Return NOT Duplicate

If last names do not begin with the same letter…

Return NOT Duplicate

If last names do not have confidence of 90% using (A, L and P)…

Return NOT Duplicate

If first names do not have confidence of 90% using (A, N, L and P)…

Return NOT Duplicate

If both names have middle names and do not have confidence of 90% using (A, N, L and P)…

Return NOT Duplicate

Absent the six fatal flaws listed above while matching last, first and middle names…

Return Duplicate

Judging Best Address

A building match with a valid secondary is chosen over default building match. CMRA records complete with a PMB are preferred over CMRA addresses without a PMB.

User Interface

The de-duplication process is activated from the File menu “de-dup” item. The item is enabled only if a List has been selected. If the presort filter detects the existence of “a_ddup.csv”, a checkbox becomes visible allowing the user to indicate which mail preparation source list to use, “a_list.csv” or “a_ddup.csv”.

Back to Top