Online Program

Return to main conference page
Thursday, May 30
Data Science Techologies
Practice and Applications
Data Science Applications E-Posters, I
Thu, May 30, 3:00 PM - 4:00 PM
Grand Ballroom Foyer

Comparing various string similarity algorithms in the task of name-matching (306269)


*Aleksandra Zaba, University of Utah 

Keywords: data science; string similarity algorithms; R; f-measure

Researchers have been investigating the performance of computers and software by means of calculating, for example, recall, precision, and the resulting f-measure. The current pilot study reports values of these measures for three groups of string similarity algorithms contained in the ‘stringdist’ package of R. The algorithms are the edit-based Levenshtein, Levenshtein-Damerau, Hamming, and the longest common substring, the q-gram based group with the q-gram and cosine measure, and the heuristic Jaro and Jaro-Winkler algorithms. Recall and precision depend not only on the system, but also on the task involved because true positives, true negatives, false positives, and false negatives vary with how the task is designed, depending on what is labeled as ‘same’ vs. ‘different’. The task that we use to compare the results of these algorithms is the following name-matching task. The algorithms are to specify values for the similarity between a base word, a female first name (so far, n=100), and three of its variants, that same name, and two of the following: its foreign version (categorized by us as ‘same’), its male version (‘different’), and a different, also female, version of the base name in American English (‘different’). It varies by field and purpose how crucial it is to avoid false negatives and false positives. For the current project, the highest f-measure score for each algorithm is sought. We report all f-measures, and these scores are interpreted in the context of the given algorithm. For our data so far, a relatively low threshold (from ‘same’ to ‘different’; assigned to an algorithm’s value for a given similarity), 1 in the edit-based and q-gram groups, and .05 and .10 in the heuristic group, provides the highest weighted average of recall and precision. Our procedures and results promise to be of interest to various fields, which include statistics (probabilistic record linkage), AI, information retrieval, DNA matching, and language acquisition.