Playing with Open Data - Part Two

May 29, 2016

Round 2 of the Open Data dissection begins with a suggestion from a friend that use time of day to pare apart the data. Excellent suggestion Suba!! I’ll try and keep this breakdown to visualisations because the wall of text in the previous post might be too much.

Google tells me the opening hours for the BCC libraries are listed on their website.

A breakdown of all checkouts by hour of day:

select HOUR(date), COUNT(date) from checkouts GROUP BY HOUR(date) ORDER BY HOUR(date);

That seems about right. I’m not sure if it makes sense that checkouts can happen at any hour of day, but it’s quite possible that special people(staff, others with access to the library out of normal hours) are able to check out resources other than the normal Library opening hours.

After using CartoDB to improve one of the bubble maps from the last Open Data post, I took another crack at the language/library breakdown.

Much better. Be sure to look at the Chinese, Vietnamese, French layers if you’re interested.

Summary

Having spent some time looking at this library data, there have been a few questions/comments raised and a few things I’d like to have in the data:

Why are some of the key fields inconsistent and/or incomplete? Does this speak to the lack of consistency in our Library’s systems?

This question is relevant to the state of the systems that are governing checkouts. Considering the strange data in the Title fields (including author names) and the author fields (including year of birth, year of death), it seems to me that Title, Author are data fields that should have strict parameters as to what belongs in there if purely from a searching/cataloging point of view. Some of this data could be cleaned up manually with a bit of manpower, at not a great cost assuming that the item_id is in fact unique to the asset (it hopefully provides the unique identification number to which all other data about the asset is tethered).

Which fields are inconsistent and/or incomplete?

“You don’t know what you don’t know.” Looking at the data is the way that I identified inconsistencies, but it’s hard to say how many other problems there could be without exhaustive analysis. I might not have noticed the duplication of records had I not been looking for something else at the time. Unreliable data causes problems when attempting to make decisions based on that data. Is this similar data used by the libraries when making their own decisions? How might that affect them? Happily for me, I’m doing nothing more than running exercises against the data for my own practice.

How is this dataset put together?

Either the dataset is munged together from individual library data, or it’s output from a central library system. The problems that are evident in the data could be a result of occurring in the main data source (which is more likely in the case of issues in the Title, Author fields) or in the process of retrieval and publishing that data onto the Brisbane City Council site.

What are the ramifications of incomplete/inconsistent data?

Either way, when issues are identified it calls into question the reliability of all the data. If care isn’t taken with this data to make sure it’s reliable, can I say that if I search on other BCC open data I will find up-to-date accurate information? What if I try to provide library opening hours to someone based on this data and it’s not kept up to date, and someone finds the library closed? We’ve all been in the situation where we checked the opening hours for a store on their website and found that their true hours are different. It undermines confidence in the data to the point that anyone using the data cannot rely on it. If I’m providing details to a user of my application and I give them the wrong information, they will begin to not trust my application’s accuracy.

I’d like to see more data available.

While I appreciate that there are privacy issues (particularly surrounding customer data) that are important when making data like this available, I’d like to have all checkouts/month available this way. 3 days a month over a year has translated into only a few days per weekday available for this data, and monitoring trends properly really requires all data available. It might also be interesting to correlate the checkouts with an individual customer ID or similar, so that while no customer data is proffered, assets that are checkout out together could be analysed. It would also be interesting to see check-ins to get a sense of how long assets are borrowed for.