Thursday, 11 October 2012

5 Steps to Choosing the Right Data

You have a project, and you need data. So you go to your metadata dictionary and search for a data source, and you discover that there are several sources that you could possibly choose. Perhaps you have multiple measures and you need to know which ones to retire. How do you make the right choice? This is my 5 steps to choosing the best data source for your project.

1.  Classify and develop your objectives:
List all of your requirements. What data fields you need, reporting frequency, timings, transaction types, granularity etc. Make sure they are either classified as 'musts haves' or 'wants'. When you have a full list, give each 'want' a weighting score - highest value being most important.  

2.  Profile the data sources.
Build and run profiles of the data in each data source. Examine the field types, volumes, dates, times, transaction types and granularity. Profiling any creation timestamps will give you an idea of the scheduling that runs on the data. 

3.  Match the attributes and profile results of the data sources against the objectives.
Based on the profiling, how well does each source satisfy the objectives? Consider the timeliness of update and batch windows. Do they match the schedule in your objectives? Are the data sources structurally compatable to your requirements? Does each data source provide the correct level of granuarity? If any of the 'must haves' are not met for a data source, reject it outright. For all the other options, total the score based on how well they achieve the 'wants'.

4.  Idenfity the risks
Take your two highest scorers and ask yourself the following questions about them:
  • What future threats should we consider?
  • If we choose the data source, what could go wrong?
  • Is our understanding of this data source good enough?
  • What are the capacity/system constraints?
5.  Choose your preferred data source:
Are you willing to accept the risks carried by the best performer in order to attain the objectives?
          If yes, choose it.
          If no, consider the next best performer and ask again.

So there you have it, a rigorous approach to choosing the best data source. How much detail you go to will depend on the rigour that is required for your industry sector.


  1. Well thought recipe Rich. What are your thoughts about blending several sources into a better mix? Also, what are your thoughts about mixing internal data and external data?

    1. Good questions. From this perspective, I would consider blending data sources in different ways as being different options in the selection process. But now you mention it, it could also be a decision/recommendation after profiling. As for internal and external data, I think this is - perhaps - a much bigger article. I was conscious of not turning this into "war and peace".
      Once again, thanks. I really value your comments. This is why I set up this blog - to exchange ideas with the rest of the industry.