Fuzzy matching entities in a custom entity dictionary

By July 22, 2019 May 13th, 2020 No Comments

Named Entity Recognition in NLP

If you have any experience in natural language processing, you have most likely heard of Named Entity Recognition (NER). In short, it’s a range of statistical, rule and dictionary-based techniques used to search through unstructured text and classify words into pre-defined categories.

Example taken from the Spacy library’s official documentation

Python libraries such as NLTK and Spacy contain their own preset dictionaries that enable you to classify locations, brands, persons, numerical features, temporal features (dates, time, duration) and more. Yet, these libraries appear insufficient for more specific tasks such as classifying a company’s products or service-related vocabulary. For such tasks, it is likely that you end up defining your own dictionary and either appending them to NTLK/Spacy’s NER-tagger or running it seperately as a dictionary-based NER.

Fuzzy whut?!

Fuzzy string matching — also referred to as approximate string matching — is a group of techniques used to match strings or words approximately, rather than exactly. Normally, if we were to match words or sentences in a text to a dictionary, we would use a typical ‘in/’ method to find exact matches.

In most practical applications, however, matching words only to their exact dictionary spouse is not sufficient. In reality, people make typos and spelling mistakes. Hence, we’ll need to find a way to determine whether words or sentences are strongly similar, yet not entirely. We’ll also need to know how much different they are if we wish to express a certain level of confidence in our match. With this information we can also play around with confidence levels versus string/word length to optimize our match rates.

In this blog, we’ll be using the concept of edit distance to deal with this challenge.

From Oslo to Snow with Levenshtein

Although there are different edit distance algorithms, we’ll limit ourselves to probably the most popular one: Levenshtein Distance. Other edit distance methods include Damerau-Levenshtein distance, Longest Common Subsequence (LCS) and the Hamming distance. These methods differ from Levenshtein mostly on the edit operations that are allowed to be performed and the costs associated with those. Edit operations include: insertion, deletion, transposition and substitution. If you are integrating fuzzy matching in your application, be sure to check out which algorithm works best for you.

Edit operations determine the steps required to turn word A into word B. The more steps required, the further the distance and the stronger the dissimilarity between word A & B. Let’s look at the Levenshtein distance between the words ‘oslo’ and ‘snow’.

The max Levenshtein Distance is 4: max(len(‘oslo’), len(‘snow’)). The total cost of edit operations for this transition is 3, with each edit operation except copy costing 1.

Fuzzy matching dictionary words with various similarity levels

Now, let’s imagine we have defined our own dictionary and we want to extend the NLTK NER-tagger with our own custom dictionary. We have several options, i.e. train the tagger on our custom dictionary, create a rule-based system or create a match-based system.

For now, we will go for the match-based system to give us matches against a dictionary. We’ll use Seatgeek’s Fuzzywuzzy implementation to handle typo’s. First, we create a simple pandas DataFrame consisting of word lengths and minimum similarity ratios (MSR) we would like to achieve.

As words get larger, the risk for typos increases and the chances of intently meaning a different word decrease. Thus, we loosen the criteria as word-length increases. Note that these are arbitrary values.

getFuzzyRatio() is used to return a minimum similarity ratio. Based on the string length, it looks up the corresponding row in the dataframe and returns the matching similarity score.

getFuzzySimilarity() uses Fuzzywuzzy to find fuzzy matches in a list of values. It then return the match if it exceeds the minimum token threshold.

Let’s test our fuzzy NER-tagger with a couple of entities that contain a typo to see if it works. We run the script above and get:

‘insurence’ matches with insurance with a score of: 89
‘houze’ matches with house with a score of: 80
‘libertey’ matches with liberty with a score of: 93

Notice that ‘houze’ is also returned as the MSR is set to 80 and the actual ratio is also 80. Let’s play around with different MSR’s to raise the bar. Let’s throw in ‘minRatio’ : [100,100,100,100,100,85,85,85,85,80]. This time, only ‘insurence’ and ‘libertey’ are returned by the tagger as 5-letter words now need to match completely to be recognized.

‘insurence’ matches with insurance with a score of: 89
‘libertey’ matches with liberty with a score of: 93

That’s it folks. Start building your own fuzzy NER tagger for exact matches and play around with the minimum similarity ratios for different word lengths to find your optimal design.

Leave a Reply