Removing ‘duplicate’ results
Let’s say you are doing entity resolution, and as part of the process you’re running to determine unique nodes (e.g. Levenshtein/Sorensen-Dice etc.). Our data set could be the following:
Tomatoes <id:1>
Tomato <id:2>
Cheddar <id:3>
Cheddar<id:4>
where <id:x> is the node’s internal ID. Let’s say we wanted to run the following query, which will compare the two names for similarity:
MATCH (i1:Ingredient), (i2:Ingredient)
WHERE
i1 <> i2
AND apoc.text.sorensenDiceSimilarity(p1.name, p2.name) >0.8
RETURN i1.name, ID(i1), i2.name, ID(i2)
We’d get an output looking something like this:
Tomato, 2, Tomatoes, 1
Tomatoes, 1, Tomato, 2
Cheddar, 3, Cheddar, 4
Cheddar 4, Cheddar, 3
Whilst strictly we don’t have any duplicate results, we only really wanted to see one of each of the rows above, which we’d then go on and do some follow-on process with. If we slightly adjust the query as following:
MATCH (i1:Ingredient), (i2:Ingredient)
WHERE
i1 <> i2
AND apoc.text.sorensenDiceSimilarity(p1.name, p2.name) >0.8
AND ID(p1)<ID(p2)
RETURN i1.name, ID(i1), i2.name, ID(i2)
we will get the following:
Tomatoes, 1, Tomato, 2
Cheddar, 3, Cheddar, 4
Now we can think about linking these similar entities together!