Removing ‘duplicate’ results

1 min readMay 14, 2019

Removing ‘duplicate’ results

Let’s say you are doing entity resolution, and as part of the process you’re running to determine unique nodes (e.g. Levenshtein/Sorensen-Dice etc.). Our data set could be the following:

Tomatoes <id:1>
Tomato <id:2>
Cheddar <id:3>
Cheddar<id:4>

where <id:x> is the node’s internal ID. Let’s say we wanted to run the following query, which will compare the two names for similarity:

MATCH (i1:Ingredient), (i2:Ingredient)
WHERE 
     i1 <> i2 
     AND apoc.text.sorensenDiceSimilarity(p1.name, p2.name) >0.8
RETURN i1.name, ID(i1), i2.name, ID(i2)

We’d get an output looking something like this:

Tomato, 2, Tomatoes, 1
Tomatoes, 1, Tomato, 2
Cheddar, 3, Cheddar, 4
Cheddar 4, Cheddar, 3

Whilst strictly we don’t have any duplicate results, we only really wanted to see one of each of the rows above, which we’d then go on and do some follow-on process with. If we slightly adjust the query as following:

MATCH (i1:Ingredient), (i2:Ingredient)
WHERE
     i1 <> i2
     AND apoc.text.sorensenDiceSimilarity(p1.name, p2.name) >0.8
     AND ID(p1)<ID(p2)
RETURN i1.name, ID(i1), i2.name, ID(i2)

we will get the following:

Tomatoes, 1, Tomato, 2
Cheddar, 3, Cheddar, 4

Now we can think about linking these similar entities together!

Written by Ljubica Lazarevic

No responses yet