Ancestry.com Publishes Technical Report on data-driven technique for finding alternative name spellings.
In this article we discuss the problem of finding alternative name spelling, an important component of name matching (part of the record linkage field). We started this project primarily because of real issues that we encountered while working on name matching for various Ancestry.com projects: tree node de-duplication (people in family trees) and search query reformulation.
A person’s name, especially the family name or surname, is the key field used in identifying person’s records in databases. Software tools and applications designed for entity resolution of a person’s records usually rely on the person’s name as a primary identification field.
Databases of records including birth, marriage and death records; military service, immigration and even incarceration records often contain alternative spellings referring to the same person due to historically different ways of spelling, OCR errors, misspellings in records or transliteration. For instance a common last name “Shepard” has also often been spelled as “Shepherd”, “Sheppard”, “Shephard”, “Shepperd”, “Sheperd”. Also OCR errors typically manifest themselves in substituting ‘s’ for ‘a’ Braunhart – Brsunhart. Clearly, having methods that would provide users and the search engines with a list of highly credible alternative spellings of the query name would significantly improve the search results.
Knowing how to spell names to find records of interest has always been a part of a professional genealogist’s domain expertise. We strive to automate this expertise for anyone with an interest in family history.
Our primary goal is to produce a ranked quality list of alternative name spellings automatically that can be used in genealogy search query reformulation/expansion and assisting with error-correction in OCR process. Our secondary goals include comparison of existing name matching methods using unified framework.
A number of algorithms exist that could be used for addressing this problem. Which ones are better? The literature search has not resulted in a single/acceptable answer. Having access to Ancestry.com search log and tree node data offers us an opportunity to try data-driven methods which theoretically should be superior to non-data-driven methods.
Creating a Training Set for Alternative Name Spellings model
If we are to train a model with the goal of predicting if name 1 and name 2 are alternative spellings of each other, how can we construct the training set? The question is how to separate name pairs into positive/negative sets?
The trick is to use extracted from Ancestry data with machine translation (MT) methods instead of trying to separate name pairs into negative/positive classes.
Ancestry data can be mined to derive evaluation corpora.
One process involves using surname rewrites from search query reformulations and another process involves surname alternatives from record attachments.
We compare our trained model with existing methods commonly used in Record Linkage field.
Using the data produced by our mining process we trained machine translation method using Moses MT library. Then, we set up standard IR evaluation framework for comparing results generated by our data-driven method with two classes of methods for finding alternative name spelling: Phonetic algorithms and String similarity measures.
MT methods rely on the central MT equation. In our re-formulation this equation states that a list of K highest ranked alternative spellings of a source name are given by K maximum values of the product of a “name model” and an “alignment model”.
The most likely translation target maximizes the product of two terms, (1) the chance that someone would have a target name in the first place, and (2) if they do, the chance that someone else would spell it as source name.
The noisy channel works like this. We imagine that someone has a target name, but by the time it gets on to the printed page (or through OCR process) it is corrupted by the “noise” and transforms into source name. To recover the most likely target name, we reason about (1) what kinds of name are there, and (2) how target names get spelled as source names.
To compute a “translation model” we need to have an estimate of a “name model” which we can learn from the training data (corpora datasets that we collect from Ancestry.com data).
Usually, this training data is hard to find. Because of the difficulty associated with obtaining the experimental data many researchers build their own synthetic datasets by utilizing specialized tools for mining the web and extracting words that appear to be last names. The resulting names are used in forming artificial pairs using one or more similarity measures (typically based on edit distance). Another popular alternative is to hire human annotators who create last name pairs based on their knowledge of name alternative spelling. Both of these methods may introduce bias.
Given the way Ancestry.com users interact with the genealogy service, we isolated two separate ways of collecting parallel corpus data that would later be used for training translation models and for testing. We felt that having two completely different underlying processes for generating our datasets would strengthen our case if we arrived at similar conclusions.
The resulted training datasets had to undergo extensive filtering process due to noise associated with user-generated data. We had to devise a special filtering pipeline for treating the data and generating a high confidence training data set.
Phonetic methods provide facility for (phonetical) name encoding before they are used as blocking or sorting key values. These methods are commonly used in the indexing step of data matching and de-duplication to bring similar sounding string values, that are often assumed to refer to names, into same blocks. The records that contain these similar sounding names will then be compared in detail in the comparison step.
Phonetic methods can also be used as a similarity measure when computing alternative name spellings.
If phonetic code generated for name 1 equals to phonetic code generated by name 2 than two names can potentially be treated as alternative name spellings of each other.
String similarity methods, another important class of methods for finding alternative spellings, are sequential methods that base similarity score on some metric that measures transformation of one strings into another in terms of number of simple operations (deletions, insertions and substitutions). Levenshtein or edit distance is a popular example of this type of methods.
In the end, our machine translation method for finding ranked list of alternative last name spellings far-outperformed all other methods we tried!
NYSIIS phonetic method significantly outperformed other phonetic algorithms and the Phonex phonetic method did not perform as well on our data.
We believe that our findings can play an important role in boosting precision in Ancestry.com search and in building automated OCR verification tool.
For technical details please see our technical report.
About Jeffrey Sukharev
Jeffrey Sukharev, is a Data Scientist at Ancestry.com, working on Search Platform and other data-driven projects since July 2013. Prior to Ancestry.com and graduate school, he was a software engineer for more than a decade working on software development tools at Rational Software, IBM and on Content Management Server at Interwoven Inc. Sukharev is currently a PhD Candidate in Computer Science from the University of California, Davis. He holds a M.S. and B.S. in Computer Science from the University of California, Santa Cruz. Jeffrey’s interests include data visualization and machine learning.