Musings of a Data Geek: October 2012

Thursday, 25 October 2012

How Scary Is Your Data?

Soon it will be Halloween. It's a time of ghosts, ghouls and demons. But all of that pales into insignificance when compared to the truly terrifying reality of kids running around the streets pumped up on chocolate, sugar and energy drinks!!!! And to celebrate the witching hour, here is my list of halloween data horrors... Don't say I didn't warn you... Mouhahahahaaaaaa!!!!

Undead data

This is the ancient data that you did not kill off. It served it's purpose years ago, and you archived it, but did not delete it. It lies in it's crypt waiting... waiting.. for the sun to set. If your regulators find out, it's you who will get it in the neck.

Alien data

It comes from another world (cue 50's b-movie music)... namely that company you have outsourced your data collection to. But you forgot to include data quality and governance standards in the agreement. And now you have data that is taking up all your resources trying to make sense of it. no-one can agree on the results and your whole organisation is paralysed.

Frankenstein data

They wanted to know diabolical things about your organisation, and they didn't care about how you did it. You could not find any documentation on your data sources, and they would not pay for a profiling tool. So you bolted and stitched huge amounts of unrelated data together to create an abomination. Deep into the night, you worked feverishly until finally you hysterically cried, "It lives, it lives".... All were amazed how you could breathe life into dead data and you reaped rewards. But deep down, you know it's only a matter of time before it either comes apart or brings your whole organisation crashing down around you.

Zombie systems

Those legacy systems died years ago. But someone keeps digging them up and re-animating them. Whoever did it, they certainly seem to have lost their BRRAAAIIIINS!!!

Godzilla data

No-one knows how or why they asked for it, but now it's here, and it's just too big. The scale is massive. All your IT staff run away screaming while it crushes servers and tangles networks. This big, 'Godzilla' data is requiring some other monster called 'Hadoop' to sort it out. They were last seen fighting off the coast of Java.

I hope you enjoyed my tales of data horror. Sleep well, now.. Pleasant dreams.... Mouhahahaaaa!!

Tuesday, 23 October 2012

Data Quality Failure - Apple Style

When Apple launched the iPhone 5, much was made of the new features of IOS6. One of which was the new maps application. This was lauded as "A beautiful vector based interface" and "Everything's easy to read and you won't get lost".

Although the application functioned well, the data it used was far from effective. Unlike the hype, people started to 'get lost'. One thing is patently clear. Apple had not conducted any data quality analysis of the databases that the maps application consumes.

All databases are models of reality. The discipline of data quality is to ensure that the database is the best model possible. It is obvious that the maps database was not checked against reality to ascertain whether it was an accurate or complete model.

An independent analysis of a sample of the Apple Maps (using the Canadian province of Ontario) provided some interesting stats. Of the 2028 place names in Ontario, 400 were correct, 389 were close to correct, 551 were completely incorrect, and 688 were missing.

Apple did not gather this information. It acquired the street and place data from Tom-Tom (the vehicle satellite navigation company) and integrated it with other databases. Despite strenuous denials of culpability by Tom-Tom, the facts show that the location data experienced by the users was missing or incorrect.

To say that this has undermined the reputation of Apple is a large understatement. It prompted a public apology by the CEO, Tim Cook.

So could Tom-Tom and the other suppliers of maps data have knowingly supplied incorrect data to Apple? Probably not. Surely Apple had data quality measurement in place? The results suggest not. Only 19.7% accurate place names and 33.9% of place names missing.

When entering into agreements with 3rd party suppliers of information it is imperative that data quality standards are insisted on as part of the commercial agreement - with penalties for non-compliance. As the results of this little mess between Apple and their suppliers show, you may be able to outsource responsibility, but not accountability.

Thursday, 11 October 2012

5 Steps to Choosing the Right Data

You have a project, and you need data. So you go to your metadata dictionary and search for a data source, and you discover that there are several sources that you could possibly choose. Perhaps you have multiple measures and you need to know which ones to retire. How do you make the right choice? This is my 5 steps to choosing the best data source for your project.

1. Classify and develop your objectives:

List all of your requirements. What data fields you need, reporting frequency, timings, transaction types, granularity etc. Make sure they are either classified as 'musts haves' or 'wants'. When you have a full list, give each 'want' a weighting score - highest value being most important.

2. Profile the data sources.

Build and run profiles of the data in each data source. Examine the field types, volumes, dates, times, transaction types and granularity. Profiling any creation timestamps will give you an idea of the scheduling that runs on the data.

3. Match the attributes and profile results of the data sources against the objectives.

Based on the profiling, how well does each source satisfy the objectives? Consider the timeliness of update and batch windows. Do they match the schedule in your objectives? Are the data sources structurally compatable to your requirements? Does each data source provide the correct level of granuarity? If any of the 'must haves' are not met for a data source, reject it outright. For all the other options, total the score based on how well they achieve the 'wants'.

4. Idenfity the risks

Take your two highest scorers and ask yourself the following questions about them:

What future threats should we consider?
If we choose the data source, what could go wrong?
Is our understanding of this data source good enough?
What are the capacity/system constraints?

5. Choose your preferred data source:

Are you willing to accept the risks carried by the best performer in order to attain the objectives?

If yes, choose it.

If no, consider the next best performer and ask again.

So there you have it, a rigorous approach to choosing the best data source. How much detail you go to will depend on the rigour that is required for your industry sector.

Sunday, 7 October 2012

The 4 C's of data management

The 4 C's are what I use to map the data journey. Here are the 4 C's:

Created

Data is created. Generally, this is done by people who key in the data manually. Your customers may be data creators if they have to key online applications. Data creators are responsible for creating data as correctly as possible.

Changed

The data is also changed. It could be as simple as someone keying a change of address in your database, or changing services for your customers. Data Changers are responsible for keeping the data in line with changes to reality.

Controlled

This includes data management, regulatory compliance and data quality monitoring and maintenance activities. In the data world, this can also cover parsing, standardising, error correction, de-duplication etc. Controllers are responsible for monitoring and controlling the data. It also covers anyone who has to archive or destroy data to fulfil data regulation.

Consumed

These are people who view or use the information as part of their job. If you share your information with your customers (account statements etc), they are also included. This also covers Data Protection Act requests for information (UK only). They are responsible for understanding the data, challenging it if they find errors, and making the correct business decisions using it.

These are really snappy ways to remind yourself about the kind of questions you need to ask about a process that you are surveying to understand roles and responsibilities.

All are responsible for process change and maintenance in their areas. All should be consulted and informed about change to the data. The one who should be held Accountable is the colleague who has over-arching control over the whole process. Obviously, if the process spans several departments, accountability can be shared across several function leaders.