how to

Text Clustering

I got into a situation when I was left with a puzzler from which everyone has put their hands off. I confess that I did not have much experience with it and went for it to learn something new. And I enjoyed it in the end! Clustering text strings!

Jiri Vicherek

Written by
Jiri Vicherek

March 09, 2019

What was it all about?

One of our potential clients gets information from various places about how their products are being sold. He would like to merge these data together and add his own sales data to it. But it’s not that easy - every resource calls one identical product differently, more or less similarly.

Let’s have the “LEGO Friends 3061: City Park Cafe” product. This could look like this in the data:

LEGO Friends - formatting, packaging, distribution, region, promotions …

All of the labels, accessories, distribution channels, etc. can be misspelt, a single supplier may combine data in any way, + you have hundreds of LEGO products and Chinese imitations.

What was my goal?

“Somehow” simplify the data and pair them to each other to create a “pairing list” that serves as a connecting element of the information from different vendors. Simply said, to be able to identify sales reported as “Lego Friends 3061: City Park Cafe” and “lego friends / 3061 / City-Park Cafe” as one and the same product. There are hundreds of LEGO products in clients’ data, so this is an example of how a linking element for LEGO Friends: City Park Cafe can look like - the input data are merged into blue columns and the aggregation of sales numbers is done according to the green columns:

How did I want to do it?

I thought about seven possibilities.

  1. go through it manually and make a list as a trained monkey would do
  2. try to let train the entity detection in the text and then get the green lines from them
  3. Try it somehow with
  4. apply text token (n-gram)
  5. apply phonetic algorithms
  6. use the Levenshtein distance
  7. and finally implement the principle of DNA sequencing - Kolmogorov complexity

Point One (1) looked the best, but it would take a long time, my error rate would change (and thus the quality of the list at the time of production), it would not be possible to correct it from the beginning if I discovered in the middle of my attempt that some products were connected wrongly. And finally: this would be difficult to extend by other data sources.

I did not want to pull in boys from (2) because I did not have the cash to pay them.

Trifacta (3) seems to me more like a (slow) clicking generator of very erroneous rules.

The remaining points (4, 5, 6 and 7) I decided to try at OpenRefine, the successor to GoogleRefine. For production deployment, you can choose to manage the instance from, about which I wrote here.

Text Clustering

If I get a fingerprint from each product name that will be unique to the product, I have the chance that I will always have different texts of the same meaning for a single fingerprint.


The n-gram algorithm is a so-called tokenization algorithm that first simplifies the text, turns everything into lowercase, removes punctuation, spaces and characters such as dashes, commas, semicolons, and so on.

Breaks the text to “n-grams” - if I do n-gram = 2, the text will be broken down by two characters: Prague = pr, ra, ah, ha

These n-character combinations are sorted and duplicates are thrown out it connects together; from Praha comes out “ahhaprra”, Kokotín will come out “inkookotti” (if I do 1-gram, Praha will be “ahpr” and Kokotín “iknot”)

This is great for real-life typing. Not so much in nonsense, abbreviations, etc., because “abc” and “bca” have the same 2-gram (ab, bc) and it’s different. But Wikipedia says it has been empirically found that if two real texts have the same “vector” (n-gram field: {“ab”, “bc”}), then it is very likely the same text. I definitely wanted to try this!

Phonetic algorithm

OpenRefine has an implemented Metaphone3 algorithm - it makes a fingerprint based on how it sounds in English. Has a ton of rules:

and for my use-case the results were pretty accurate!

Levenshtein’s distance

Sometimes it is called “edit distance” - it’s super easy: for every two strings of text it assigns a number that is equal to the number of change operations you have to make to get the second from the first text.


Coffee Beer = 4Coarse Course = 1Lego Friends 3061: City Park Cafe Lego-Friends 3061: City Park Cafe = 1

The more distance you tolerate, the greater the chance of replacing two different texts. Small distance “guarantees” the correction of typing, misspelling, etc.

Kolmogorov complexity

It is implemented as PPM in OpenRefine. When I found out, that it was used to search for the similarities in the DNA-derived strings I could not go to sleep until I tried it!

The idea of ​​this algorithm is based on idea thought that text compression “somehow” (mathematically) works with the information value of compressed text. If I join A + B one after another and compress it, the resulting size (A + B) will de facto be the same as the compression itself (A).


I took the previous paragraph as document A and document B, saved it once in document “docA.txt” and then I saved it twice (A + B) into document “docAB.txt”. I have compressed doc with (239b) and docAB (249) with a gzip - the difference is 10 characters, while the input (A + B) document is 271 characters longer. The difference in compression of two identical documents compared to compression of one (identical) document thus makes 3.6%. If A was the first paragraph of the description of this algorithm and the second paragraph was B, the difference in compression A and compression A + B would be 53.1%.

Therefore, the PPM algorithm in OpenRefine takes two different texts and applies compression in different combinations, from which it always gets the size. It deducts the coefficient of difference from it according to this pattern:

In my example of identical A and B the pattern would look like this:

Finally, if A is the first paragraph of this description and B is the second paragraph, the pattern will be as follows:

It is good to note that this approach has better results in detecting similarities if the text is longer. You can take notes from the Scout Chronicle and sort them by close similarity to find that they have similar stories, although this is of course always a completely different text!

Lets jump into practice!

I will not try to describe installation and work with OpenRefine - I’ll leave it for the official screencasts.

As soon as my data has been uploaded, I started by duplicating the product column. The goal is to change a copy of the values so that at the end I will get a pair “original text” <> “text”:

Then I chose “Face Faceting” on the new column and pressed the “Cluster” button in the left box:

A dialog box appears, where you can select the method (tokenization / similarity) and the function (n-gram / metaphone / …) and it is possible to change the parameters. OpenRefine applies the calculation for you and displays suggestions of which rows from our column belong together. It is good to understand algorithms and play with them. You do not have to be a mathematician who will implement it in C ++, but it is good to know that for Levenshtein’s distance you should not give Radius=20 because it will replace “Pork Feast” for “Pheasant Race”.

If the estimated Cluster is OK to you, check “Merge?” and name the values for that cluster. Whenever you go to “Merge Selected & Re-Cluster”, what you set up is applied and everything is calculated again. I first made clusters using sensitive tokenization algorithms (n-gram = 1) and then went to PPM / Metaphone / Levenshtein (radius>1) … When there is nothing left to cluster (or it just does not progress further), close the window and the result, in my simple LEGO example, is just like a fairy tale:

Imagine that there are thousands of rows in the left column of dozens of products - then in the right column we have the target of having dozens of clusters. The use is obvious: the sales data from many sources will be joined here (the key is the source column + the original product column), and the values from the OpenRefine column link everything that belongs to one product, regardless of how messy it is defined in the data source.

Findings of my struggle

The quality of my data is really bad. It is good to try and clean them a little bit for better clustering. I used UPPER () in SQL to convert all the characters into uppercase letters, remove the explicit nonsense (region information, type of packaging, etc.) and deduplicated the rows - from tens of thousands of items I had about 1,500 at one time. In my situation, I occasionally needed to delete the text as “ABC”, but only at the end of the line (i.e. leave “ABC Special”, but “ABC Special ABC” at the end to throw away). This is a great fit for regular expressions. Do not be scared by it! In the DB there is a REGEXP_REPLACE that would remove “DHL” and “PPL” from the end of the line as follows:

The `$` character says it matches only if it’s at the end of the line (so I usually have the still TRIMOUT the column so I do not have a gap at the end and do not have the regexp pattern ‘DHL $ | PPL $’)

and the pipe (“|”) separates the field of values I am looking for (so I do not have to call REPLACE as many times as I want to change.) If it matches, it will replace it with nothing (‘’). I have quite a number of such rules, it simplifies my work in OpenRefine - less data, more accuracy. If you put “^” in front of a string instead of ‘$’, it means that it only searches at the beginning of the line. The most of the rules that you may need are easy to find on Google. Pay attention to regular expressions, it’s a very strong thing!

Well, the best comes at the end

In OpenRefine, each action generates a rule. You can view these by switching at the upper left of the logo under the “Undo / Redo” menu:

The whole definition of Clustering is interactive - you do not want the machine to decide for you in this case - but once you define it, you can easily repeat it. To do this, just export commands / rules (definitions in json):

If you copy it (my cfg from this blog post is here), you can take the original data as it looks when it’s loaded into OpenRefine and apply these rules directly to it - it’s not necessary to click it all again. This leads to full automation (the final question of the FAQ), which is a bit of a hack. Does anyone have experience with it?

I hope that it will kick you a little bit into experimenting with OpenRefine, which is an extra powerful tool and it pays to learn to use it - at least the basics. You can find videos such as calling an external API to enrich your cell values etc., on the Internet,

Josef Šlerka, before he came to the R Shiny apps, I thought loved it - there would surely be a super source of inspiration from him! At least the link to a CheatSheet.

Update 1

Josef added a comment to my FB post with a link to his presentation, where this topic goes deeper into detail #super!

Update 2

I did quite a lot of careful work with my preparation before joining OpenRefine. It did not stand out, but the cleanup is crucial. Here’s an example of how I tear the text according to a rule that resolves logic in the text.

I have these ten sample lines:

And I’d like to take everything to the left by product code. But it is not given where exactly the product code is. In my case I have defined it as a number containing more than 3 digits.

Using this:

line 3-6 is a function which has two parameters: column and regexp patern – it returns matched pattern’s position

I’ll get:

If I want to find a date, which is what Ondřej Tučný asks here, I will only make another definition of a regular expression (source for reading).

fce in lines 7-13 takes the column (row 8) as one parameter and the second parameter position (l.9-12). Since the position starts at the point where the matched pattern is, I deduct one from the position (I.12) so that I do not have the first character of my number in the LEFT () product

If I take everything to the left from this position, using LEFT (), I get this:

On the first line one can calculate that “(“ is at position 14, but REGEXP_INSTR returns 15 - therefore the “-1” on line 12 in SQL. but of course I’m going to write a Regex matching number and decimals rule. And I have placed inside of the function “replace” which deletes my dot:

And finally, I dropped all the characters I did not want have, did it with the UPPER case letters and removed any leading / trailing spaces using TRIM ():

line 11-20 is what’s visible so far (because I’m not doing it anymore, so I dropped a list of matched positions); on line 4-8 I will replace all characters that are not from the alphabet or spaces ( or the word “edition” with a space - I will delete the mess; on row 3, I will make the result with upper case letters and cut the spaces between them with TRIM

This kind of code will return exactly what I want:

You can see that I can play with it, iteratively tune the regexp patterns to follow the logic of the text and only when I have nice product names in my own column, I will send it to OpenRefine.

If there was not a “log” of sales, but perhaps a review from Amazon where people would write about how much not fun it is to build a LEGO 3061, I would go the path of my own entity Detection Model - which I’m going to do for another client - so I’ll see how it turns out :)

More how to Articles