Let’s model and l̵o̵a̵d̵ refactor some data!

Ljubica Lazarevic
3 min readFeb 22, 2021

--

The write-up of the second part of our modelling and data loading revisit, we explore how asking new questions changes the data model.

Mark Rabe on Unsplash

Overview

Here’s another super quick write-up of the second session we did on Twitch. Again, it’s a leisurely step through for those of you who want to visit this at your own pace. If you want a reminder of the first, part, check out the write up here.

Here’s the second stream, if you want to re-watch it for posterity (and observe how I’ve moved around in my study yet again!):

And here are all the links:

Data:

The Arrows modelling tool

The Sandbox (we used a blank one)

And of course, if you have any questions, come join our friendly forum

Modelling

We discussed what were the new questions we wanted to ask, and as a result why it made sense to update our data model to answer these. We then revised our data model in Arrows. If you’d like to see the one we drew during the session, click here.

Loading the data

We used a blank sandbox for loading the data. A sandbox lives for three days, extendible to a maximum 10 days. As the sandbox I was using had expired, I created a new one. To load all of the previous data, you can use the queries from the previous blog post.

First of all, we need to create two new indexes:

CREATE INDEX ON :County(name);
CREATE INDEX ON :Address(streetName, zip);

Now for the new nodes. In the video I used LOAD CSV again to populate the data, rather than refactor the physical data. Why? Because we never loaded StreetNumber, Zip or County in the first place :).

We discovered in the CSV that Street Address didn’t always have a building number, where as Location did, so here’s the updated query for creating the Address node:

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/misc/main/Retail_Food_Stores.csv' AS row
MATCH (e:Entity {licenseNumber:trim(row.`License Number`)})
MERGE (a:Address {streetName:coalesce(trim(row.Location),trim(row.`Street Name`)), zip:trim(row.`Zip Code`)})
ON CREATE SET a.city = trim(row.City)
MERGE (e)-[:HAS_ADDRESS]->(a)

And then to create the County node:

LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/lju-lazarevic/misc/main/Retail_Food_Stores.csv' AS row
MATCH (a:Address {streetName:coalesce(trim(row.Location),trim(row.`Street Name`)), zip:trim(row.`Zip Code`)})
MERGE (count:County {name:trim(row.County)})
MERGE (a)-[:HAS_COUNTY]->(count)

Querying the data

We have refactored our data model because the question we wanted to ask (how many businesses are at the same address) means it makes sense to extract out an Address node. We could have asked the same question with the first model, but it’s not playing to our graphy advantage.

This query finds out the top 10 addresses with the most businesses associated with it, most likely because they’re all in the same shopping mall or business complex:

MATCH (a:Address)<-[r:HAS_ADDRESS]-(e2:Entity)
WITH a, collect(e2) as entities
RETURN a.streetName, a.city, a.zip, size(entities) as count order by count desc limit 10

What next?

We’re going to leave this example here for now, although I think it would be good to revisit this in the future for the following options:

  • Cleaning up the data (we saw some fun stuff towards the end of the session)
  • Refactoring the data without a load

Watch this space! And of course, you could try the above ahead of time :)

--

--

Ljubica Lazarevic
Ljubica Lazarevic

Written by Ljubica Lazarevic

Technologist — data geek — solver of problems

No responses yet