Playing with Open Data
With a little time on my hands (in between what I should be doing, looking for a job), I thought it might be fun to have a quick look at the data made available by the Brisbane City Council as part of their “Open Data” initiative, to make sure I haven’t forgotten everything I learned about SQL and providing an opportunity for some more Python fun.
The short summary is the data was somewhat inconsistent and incomplete (particularly the Title and Author fields are unreliable). There were still some interesting correlations and visualisations to be made. I’ll dig out what I can.
I will continue to look at this data and dig out what I can. A lot of this page is more in depth than most people will bear. I don’t blame you if you scan through the page and look at the pictures, but for those who wish to know why I drew the conclusions I did, the information is here too.
If you wish to follow along, I’ve made SQL dumps of the two tables I use checkouts and library
After sniffing around the website the most interesting data from my point of view seemed to be the Library data. Available in simple csv files and detailing all checkouts made during three days each month for the last 12 months, it seemed like the most interesting chance for data mining.
I wrote a little Python script to parse the csv and drop it into MySQL (I could have imported it one of a number of existing ways, but as I found later some transformation of the data needed to be done anyway).
The first thing I noticed was that the date wasn’t in a MySQL-able form, but that wasn’t a big deal (YYYYMMDDHHMM becomes YYYYMMDDHHMMSS with the addition of two 0’s to the date string.) And shortly after I had all 11 csv files available pushed into my new MySQL table.
So, let’s look at the data. There are 608,600 checkouts.
What’s the breakdown of checkouts during these days? Note: it’s hard to infer much from this because of the limitations of the data: the BCC libraries only make 3 days worth of data available per month, the 7/8/9 of every month.
I suppose I could check if I have a reasonably even distribution among the days of the week…
So, at least we have a reasonable scattering of days, though the data isn’t really that complete.
Where do the most checkouts happen, and what’s their distance from the centre of Brisbane? (This one required adding the lattitude and longitude for each library alongside the library which came as another set of data that I join into several of the queries.)
Not too many issues there, plenty of the libraries with the most checkouts are further out from the centre. I dropped the distance into the ‘library’ table to make it easier to run queries with later.
(Edit: After using CartoDB to improve a Bubble Map later in the post, I found a far better visualisation for this data:)
Try another tack, how many DVD types are there?
+-----------+ | item_type | +-----------+ | DVD | | BSQ-DVD | | LOTE-DVD | | MBG-DVD | | DVD_R18+ | | DVD_MA15+ | +-----------+
Sure, ok. A little annoying there’s not obviously a standard “I’m a DVD or type of DVD” type string so I need to be a bit inefficient in my queries, but let’s keep moving. Also, it looks suspiciously like “BSQ-DVD” is a “Brisbane Square Library DVD” but a quick check says that though most (2523) of the BSQ-DVD’s are checked out from BSQ, many (487) aren’t, so I’ll give the benefit of the doubt.
+----------+ | COUNT(*) | +----------+ | 90897 | +----------+
Anything interesting about where those DVD’s are rented? I need another temporary table for this, and I’ll dump out the data to use in my graph.
Nothing too exciting. Presumably the libraries that have better DVD sections see more rentals, though it could be related to geography, demographic.
The average(mean) of dvd checkouts over the whole dataset is about 15% (14.94%).
Looking at the dataset I wonder if language might be interesting to look at. Here I find that the language field isn’t populated for English. It appears that most checkouts have a blank, or ‘UNKNOWN’ language type, but looking at the top 10 of what’s left:
Chinese is obviously the standout there. Is that related to library?
SELECT library.name, count(checkouts.language)
So language is an informative target. Several languages apply to only a few libraries as notable outliers.
Try a different tack. Who are the most popular authors?
No-data rears its head again. Incomplete data makes me sad, but it’s still interesting for what is there. Who are these people?
+--------------------------------------------------------------------------------------------------------------------+-------------------+ | title | author | +--------------------------------------------------------------------------------------------------------------------+-------------------+ | Play it again, Mozart! / Geronimo Stilton | Stilton, Geronimo | | The secret of the sphinx / by Geronimo Stilton, illustrated by Gianluigi Fungo ; translation by Nanette McGuinness | Stilton, Geronimo | | Mice to the rescue! / Geronimo Stilton | Stilton, Geronimo | | The search for sunken treasure / Geronimo Stilton | Stilton, Geronimo | | Mouse in space! / Geronimo Stilton | Stilton, Geronimo | | The volcano of fire : the fifth adventure in the Kingdom of Fantasy / Geronimo Stilton | Stilton, Geronimo | | The search for treasure : the sixth adventure in the Kingdom of Fantasy / Geronimo Stilton | Stilton, Geronimo | | The journey through time / Geronimo Stilton | Stilton, Geronimo | | The Hunt for the curious cheese / Geronimo Stilton | Stilton, Geronimo | | Attack of the bandit cats / Geronimo Stilton | Stilton, Geronimo | +--------------------------------------------------------------------------------------------------------------------+-------------------+
Looks like a Children’s book author. Similar results for Daisy Meadows suggest the same thing. Probably children’s books are the kinds of things often borrowed and the authors are prolific. That’s just a guess.
Can I double check that? Looking at the item_types for those two authors…
+------------+ | item_type | +------------+ | JU-PBK | | JU-FICTION | +------------+
+------------+ | item_type | +------------+ | JU-FICTION | | JU-PBK | | LOTE-BOOK | +------------+
JU looks to be juvenile. Note that by this point I’m trying to use generally inefficient queries (like %JU%) just because my queries are executing in under a second and I’m finding the data a bit inconsistent in general (multiple special cases)
+------------+ | item_type | +------------+ | JU-PBK | | JU-FICTION | | JU-MAGS | +------------+
PBK assumedly paperback, mags assumedly magazines, why does fiction fit as the third type? This data has a lot of strange decisions like that.
What’s the total number of juvenile asset checkouts in my dataset?
+-----------------+ | COUNT(title_id) | +-----------------+ | 93147 | +-----------------+
93000 is only about 1/6 of total checkouts I have(600,000).
10 most popular authors in the juvenile section…
+------------------------+-----------------+ | author | COUNT(title_id) | +------------------------+-----------------+ | Stilton, Geronimo | 9507 | | Meadows, Daisy | 5308 | | Blade, Adam | 1887 | | Griffiths, Andy, 1961- | 1540 | | Larry, H. I | 1528 | | Stilton, Thea | 1408 | | Kinney, Jeff | 1392 | | Rippin, Sally | 1189 | | Dahl, Roald, 1916- | 912 | | Goscinny, 1926-1977 | 910 | +------------------------+-----------------+
I really don’t like that some of these authors have birth years and missing death years. What if someone dies, is all my data useless? Presumably the source database for the csv’s I get is somewhat normalised but chunking the date in with the year like this is pretty annoying.
Is Roald Dahl named otherwise as well? …
+--------------------+ | author | +--------------------+ | Dahl, Roald, 1916- | | Dahl, Roald | | Roald Dahl | +--------------------+
Somewhat annoying. Oh well, something to keep in mind.
+-----------------+ | COUNT(title_id) | +-----------------+ | 1071 | +-----------------+
So, just for Roald Dahl, I’m missing (1071 - 912) checkouts if I don’t grab ALL his possible names (hoping he doesn’t appear as R. Dahl as well) and not just the most common one (which inexplicably has the birth date munged in. Data is grand!)
I wonder if I look for the most checkout author+title combination the big children’s authors will still stand out?
+-------------------------------------------------------------------------------------------+------------------------------------------+--------------+ | title | author | COUNT(title) | +-------------------------------------------------------------------------------------------+------------------------------------------+--------------+ | The National geographic magazine | National Geographic Society (U.S.) | 665 | | Lonely Planet Traveller | BBC Magazines (Firm) | 432 | | Choice : independent information for smart consumers | Australasian Consumers' Association | 351 | | Australian geographic | Smith, Dick, 1944- | 330 | | Personal : a Jack Reacher novel / Lee Child | Child, Lee | 302 | | History today | Quennell, Peter, 1905- | 296 | | Jamie magazine | Oliver, Jamie | 290 | | Australian photography + digital | Australian Photographic Society | 288 | | Houses | Royal Australian Institute of Architects | 288 | | The enchanted charms : the seventh adventure in the Kingdom of Fantasy / Geronimo Stilton | Stilton, Geronimo | 276 | | The burning room / Michael Connelly | Connelly, Michael, 1956- | 269 | | Britain | British Tourist Authority | 261 | | Memory man / David Baldacci | Baldacci, David | 231 | | The girl on the train / Paula Hawkins | Hawkins, Paula | 226 | | The treasure of Easter Island / Geronimo Stilton | Stilton, Geronimo | 214 | | Surfing for secrets / Geronimo Stilton | Stilton, Geronimo | 212 | | Your garden (Clayton, Vic.) | Southdown Press | 211 | | The 39-storey treehouse / Andy Griffiths ; illustrated by Terry Denton | Griffiths, Andy, 1961- | 209 | | The Stranger / Harlan Coben | Coben, Harlan, 1962- | 197 | | Make me / Lee Child | Child, Lee | 197 | +-------------------------------------------------------------------------------------------+------------------------------------------+--------------+
This is interesting. Some of the title/author combinations are periodicals or series (Nat Geo) but 300 checkouts for Lee Child’s “Personal”?
+----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+ | title_id | title | author | call_number | item_id | item_type | status | language | age | library | date | +----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+ | 4940 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669453 | AD-FICTION | CHECKEDOUT | | ADULT | IPY | 2015-06-07 14:22:00 | | 12246 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095560066 | AD-PBK | CHECKEDOUT | | ADULT | BRR | 2015-06-09 10:10:00 | | 12935 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669503 | AD-FICTION | CHECKEDOUT | | ADULT | HAM | 2015-06-09 10:19:00 | | 14645 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095559217 | AD-PBK | CHECKEDOUT | | ADULT | TWG | 2015-06-09 10:47:00 | | 15084 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669859 | AD-FICTION | CHECKEDOUT | | ADULT | FAI | 2015-06-09 10:54:00 | | 15565 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095670188 | AD-FICTION | CHECKEDOUT | | ADULT | INA | 2015-06-09 11:02:00 | | 19125 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669693 | AD-FICTION | CHECKEDOUT | | ADULT | IPY | 2015-06-09 11:59:00 | | 19602 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095560330 | AD-PBK | CHECKEDOUT | | ADULT | CDA | 2015-06-09 12:07:00 | | 21955 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095559225 | AD-PBK | CHECKEDOUT | | ADULT | CNL | 2015-06-09 12:54:00 | | 24542 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095559142 | AD-PBK | CHECKEDOUT | | ADULT | KEN | 2015-06-09 13:43:00 | | 27262 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669255 | AD-FICTION | CHECKEDOUT | | ADULT | ASH | 2015-06-09 14:36:00 | | 29119 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000040032021 | AD-FICTION | CHECKEDOUT | UNKNOWN | ADULT | INA | 2015-06-09 15:13:00 | | 29137 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | MTG | 2015-06-09 15:13:00 | | 29937 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095559571 | AD-PBK | CHECKEDOUT | | ADULT | INA | 2015-06-09 15:28:00 | | 30339 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095670220 | AD-FICTION | CHECKEDOUT | | ADULT | CNL | 2015-06-09 15:33:00 | | 30517 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095561114 | AD-PBK | CHECKEDOUT | | ADULT | WYN | 2015-06-09 15:36:00 | | 30582 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-PBK CHI | 34000095559860 | AD-PBK | CHECKEDOUT | | ADULT | CNL | 2015-06-09 15:37:00 | | 34828 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095670410 | AD-FICTION | CHECKEDOUT | | ADULT | BUL | 2015-06-09 16:42:00 | | 35870 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095557641 | AD-FICTION | CHECKEDOUT | UNKNOWN | ADULT | CDE | 2015-06-09 17:10:00 | | 36831 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095670436 | AD-FICTION | CHECKEDOUT | | ADULT | SBK | 2015-06-09 17:47:00 | +----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+
This shows what a mess this data can be. Two different item_types (ADULT FICTION, ADULT PAPERBACK I’m assuming). Maybe they’re hard and soft cover versions. No real clue as to call_number’s significance vs item_type. Two different language types. The item_id is obviously a unique identifier though, so that might lead somewhere interesting.
How many copies of this book are there?
+--------------+ | COUNT(items) | +--------------+ | 132 | +--------------+
Wow, that’s a lot, no wonder it’s up there.
+--------------+ | COUNT(items) | +--------------+ | 189 | +--------------+
Wait, what? There are more records if I try to find distinct records based on book + library. How is this possible? Perhaps the same book can be checked out of different libraries. A quick check of the BCC library site says that you can request a book from another library. Assumedly that’s how this happens.
I confirmed it:
And found many books checked out from more than one library.
I double-check my SQL is right by identifying the records for the top item_id returned in the previous query:
+----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+ | title_id | title | author | call_number | item_id | item_type | status | language | age | library | date | +----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+ | 29137 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | MTG | 2015-06-09 15:13:00 | | 159857 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | GCY | 2015-07-12 08:52:00 | | 250605 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | GCY | 2015-07-12 08:52:00 | | 311702 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | BUL | 2015-09-09 19:12:00 | | 342708 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | BUL | 2015-10-09 14:15:00 | | 441798 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | CRA | 2016-01-08 17:16:00 | | 569824 | Personal : a Jack Reacher novel / Lee Child | Child, Lee | AD-FICTION CHI | 34000095669297 | AD-FICTION | CHECKEDOUT | | ADULT | WND | 2016-04-07 18:43:00 | +----------+---------------------------------------------+------------+----------------+----------------+------------+------------+----------+-------+---------+---------------------+
It confirms what I suspected, but raises more questions: there’s a duplicate record (same book, time, library). Is that my fault during import?
Running grep on the files I have from the open data website tells me it’s not my fault. curl on the source tells me the same.
Library-Checkouts-all-Branches-July-2015.csv:121778:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-FICTION CHI,34000095669297,AD-FICTION,CHECKEDOUT,,ADULT,GCY,201507120852 Library-Checkouts-all-Branches-July-2015.csv:212526:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-FICTION CHI,34000095669297,AD-FICTION,CHECKEDOUT,,ADULT,GCY,201507120852
While I was looking at the csv files again, I noticed another issue in a grep through the files:
Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095560330,AD-PBK,CHECKEDOUT,,ADULT,CDA,201506091207 Library-Checkouts-all-Branches-June-2015.csv:FASTBACK - Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095557765,FASTBACK,CHECKEDOUT,,ADULT,BSQ,201506091223 Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095559225,AD-PBK,CHECKEDOUT,,ADULT,CNL,201506091254 Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095559142,AD-PBK,CHECKEDOUT,,ADULT,KEN,201506091343 Library-Checkouts-all-Branches-June-2015.csv:FASTBACK - Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095557906,FASTBACK,CHECKEDOUT,,ADULT,BNO,201506091425 Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-FICTION CHI,34000095669255,AD-FICTION,CHECKEDOUT,,ADULT,ASH,201506091436 Library-Checkouts-all-Branches-June-2015.csv:FASTBACK - Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-PBK CHI,34000095558490,FASTBACK,CHECKEDOUT,,ADULT,ASH,201506091444 Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-FICTION CHI,34000040032021,AD-FICTION,CHECKEDOUT,UNKNOWN,ADULT,INA,201506091513 Library-Checkouts-all-Branches-June-2015.csv:Personal : a Jack Reacher novel / Lee Child,"Child, Lee",AD-FICTION CHI,34000095669297,AD-FICTION,CHECKEDOUT,,ADULT,MTG,201506091513
What’s this FASTBACK nonsense in the title field?
+----------+ | COUNT(*) | +----------+ | 6495 | +----------+
It gets better. Tell me that they don’t dump FASTBACK into the title field just because the book needs to come back quickly…
Looking at the BCC library’s page I find..
“A FastBack book loan is for seven days and cannot be renewed. This shorter loan period allows a high turnover of popular material. If you don’t think you can return the book within seven days, you can borrow or place a hold on other copies of the book as a standard 28 day loan. FastBack books need to be returned to the same library they were borrowed from.
Ok, interesting. I don’t know why it belongs in the title column instead of it’s own identifying column but at least I can work with it. I can drop it into my ingestion process. Also of note: the item_type is also set to FASTBACK (boggle)
I found a link to the BCC’s online asset search, elibcat, and searched for the Lee Child book
The summary says 24 copies, but the page lists closer to the 100 or so I see in the checkout logs, but the same weird data shows. The copies appear to have something like an ID, but the same ID is associated in multiple places. In those places it often has a differing “status”, so it’s likely that it’s multiple copies of the book with the same ‘Copy’ ID, not a single copy associated with multiple places.
I guess that’s about as far as I’m willing to push this data for now. The big inconsistencies I’ve found in the fields:
The title field: This field is ostensibly just the name of the book, but a sift through the data shows that’s it’s used for a couple of other things too. Sometimes it has a short title, sometimes a long. In the case of the FASTBACK identifier it’s just pushed into the Title field as well as the author for no apparent reason. It’s possible this only happens in this data dump provided in the open data, but it’s still a big inconsistency. Sometimes the author’s name is also dumped into this field, and sometimes not.
The author field: The occasional addition of birth and death years, the arrangement of the author’s name (‘Lastname, Firstname’ or ‘Firstname Lastname’) don’t inspire confidence in this field.
The language Field: Incomplete. If the book is English it gets a blank language. However, some english books get an ‘UNKNOWN’ language (as in the Lee Childs book above)