Musings of a Data Geek: February 2013

Sunday 24 February 2013

The Politics of Data

People are finding new and interesting ways to use their computers to enrich their lives. Many are choosing to be politically active by making discussion groups to exchange ideas and challenge existing rhetoric. They form alliances and pressure groups. The internet is doing more to roll grass roots politics to the public than anything that has been before.

On line petitions have become very popular. There are websites and organisations like avaaz.org and change.org who promote ideas to their members and collect electronic 'signatures'. It only takes a politician to announce an unpopular piece of legislation on television, and they can expect to be handed a petition signed by tens of thousands of people from one of these websites within a couple of days.

This kind of instant feedback to politicians must be very useful to gauge public reaction and prevent policy mistakes. It is also doing lot to re engage politicians with a public who have often been marginalised by intense political lobbying from corporate and foreign interests who have close access to government officials.

Although politicians must take these petitions seriously, they may have a good reason to challenge their validity. The question is about authentication - the process of making sure the person signing the online petition is who they say they are.

Authentication of online petitions require very little information - the email address, forename, surname and postcode. Anyone with access to an electoral role and also possessing a large volume of email addresses could possibly build an automated process to generate thousands of signatures on an online website.

So politicians cannot be sure that all of the signatures are true. They could be falsified by corporate or political interest groups. They could also be from people living outside your country, as the internet is truly global, people living outside your country can authenticate if they know a valid postcode in your country.

Personal authentication to an individual is certainly within the realms of possibility, with the use of biometric fingerprinting, RFID chips and identity cards. However, there are fundamental human rights and personal freedoms that need to be addressed.

Until we can find a solution to authentication that does not compromise our rights to privacy and freedom, online petitions may not be as effective as we hope.

Thursday 21 February 2013

A morning with Talend's BIG DATA

On the 20th of February, Manchester played host to Talend's Big Data roadshow. It promised to introduce people to the basics of big data and showcase Talend's new versions of their open source software.

The attendees were from varying industries throughout the North West of England. I was quite surprised to see some familiar faces from previous projects. It's a small world!

Ben Bryant was our technical presenter who ably took us through the usual introductions, and explained Talend's history as a leading supplier of open-source Data Integration, Data Quality and Master Data Management software. I was quite familiar with their products, but had not seen their big data solutions in action.

Once the basics were out of the way, Ben went on to explain the fundamentals of big data and how Talend's tools integrated into the current big data infrastructure. We were treated to a demonstration of an active set of slave and master servers, with Talend extracting, loading and analysing the data. An example of profiling using was also shown. We were introduced to a Hadoop infrastructure, Pig Latin and Hive. All in plain English, refreshingly free of jargon and acronym abuse

My impression from the presentation is that Talend have a clear vision of the future, and that Big Data has a part to play in it. Their goal appears to be to make their tools as simple to use with this new way of structuring and processing data. Education of customers has to be key, for it is the users who will find new and exciting ways to innovate using their technology. These innovations will find their way into future releases of Talend through their community of developers.

What I saw, did not inspire me with an immediate need to build a big data solution. For this is still an emerging technology. However, it did say to me, "Hey, Rich. Big data is simpler than you think... Next time someone talks to you about unstructured data like emails, social media, images etc... Why not have a go?"

Monday 18 February 2013

A weekend with Talend

I am constantly surprised by the wealth of open-source software that is available for use. So when flu struck myself and my family this weekend, which put paid to our plans, I decided to do some software evaluation. Talend have been occupying the 'visionary' side of Gartner's Magic Quadrant Data Quality software for some time now.

Last year, I evaluated their Open Profiler, and found it useful, if a little clunky. But that is the tip of the iceberg. They also provide a Data Integration tool, which I decided to have a go with.

The version I was using is version 5.2.1 Installing was simple, I merely extracted the zip file that you download from the Talend's website, and selected the file that would run the software. There are Linux, OS-X, Solaris and Windows options that all come packaged and ready to run in 32 or 64 bit options. The program runs on the well-known eclipse graphical user interface, so it depends upon having Java installed on your machine.

Once opened, the program took time to fully load all of the tools. But when selecting them from the panel on the right, I could see why. There is just about everything you need to be a one-man data integration specialist. It contains enough JDBC connectors to enable you to connect to just about any database.

The database I chose was an old instance of MySQL that I had on my computer for some time. I set up some dummy data and dived right in.

The whole package is just the right mix between simplicity and configuration. Extracting, parsing, joining, transforming data is very straightforward. The way the program deals with type 1, type 2 and type 3 aspects of slowly-changing-dimensions is fantastic. That function alone makes it an outstanding piece of work that should save you a huge amount of development time. All of the modular jobs have the ability to export their results and the details of any errors back into the database of your choice, or perhaps into files. This means you can produce comprehensive management information on the efficiency of your processes.

Once you have built your DI jobs, you can export them as self-contained programs that can be deployed within the package or platform of your choice. As long as javascript is enabled, the jobs will run. Before the weekend was out, I had a fully functioning, scheduled data warehouse, with a comprehensive detail layer and a presentation layer of summarised MI ready to plug an OLAP portal into.

There are some limitations to this package. If you are to work with a group of analysts, a shared repository is vital. However, you have to get the enterprise version for that, and that doesn't come cheap. But leaving that aside, Talend have to be congratulated for putting together quite an impressive piece of Data Integration software.. Honestly, just I can't believe it's free.

Next week I will be attending Talend's roadshow to see their new developments in the data science discipline of 'Big Data'. I will let you know how it goes.

Saturday 9 February 2013

Data lineage - a cautionary tale

Recently, laboratories in Ireland discovered that some of their processed beef contained horse meat. The public were outraged, and the Food Standards Agency insisted that all retail sold processed beef in the UK was DNA tested for horse meat. Some high profile branded processed beef products have been found to contain 100% horse meat.

Now the general public are very concerned that they have been eating food that could have been contaminated with chemicals that are used in the rearing of horses.

A senior politician was quoted to say, "We need to know the farmer and the meat processor."

It seems that the modern processing of meat has become a very complicated business, with different parts of animals being moved from one company to another. No-one truly knows where their processed meat comes from.

You may be surprised to hear that there are many companies who deal with data in a similar way to the meat processing industry. They may know where the data is manufactured, and where the results appear in reports, but it is the processing in the middle that they don't understand.

Data may arrive in a database, then get extracted, parsed, standardised and moved from one mart to another. It may be summarised and moved into multiple spreadsheets, where adjustments are manually made and then the data is re-extracted into other systems before finally finding it's way into a report. The full map of systems, processes and departments involved in the processing chain may not be known by just one person in the organisation. It is also unlikely that any of it is written down!

The understanding of how data is manufactured, processed, stored and used is called 'data lineage'. The financial industry has already addressed the problem of companies not knowing their data lineage by the EU directive Solvency II. Although it is not in force yet, the value in understanding data lineage is now becoming law.

If your company is large, tracking your lineage may be an expensive business. Certainly, the software is very expensive. Such costs may be hard to justify in the present financial climate, but doing it now, on your own terms, is far cheaper than waiting for an angry public and government legislation to force you to do it.

Sunday 3 February 2013

BI Centralisation - The Challenges

Perhaps your company has acquired a number of other businesses. It could still be suffering from the remnants of older ways of working. But a quick look around many companies will show that management information is being generated in many different areas, with varying levels of accuracy.

This can cause you a great deal of problems with multiple, conflicting versions of the same measure being manufactured throughout the organisation.

Many organisations are looking to build Business Intelligence competency centres, by pooling resources, systems and processes into one area. This has many advantages. But you may encounter some fierce opposition to your plans. Here are the top reasons why people will oppose your plans:

1. Exclusivity
People like to manufacture their own MI because it gives them a first look at the figures before everyone else. So if you're in the sales department and you manufacture the sales figures, you see them first. You can start thinking up excuses as to why you haven't hit your targets way before anyone else knows about the results.

2. The illusion of control
I'm not sure why this happens, but departments like to control their MI, because somewhere in their heads, it implies that they can control the business itself. Manufacturing your own MI only brings benefits if you do not have a data quality department.

3. Analysts have become too powerful
Very often the MI analysts know more, and make more business decisions than the managers. If their analysts were sucked into a centralised department, or made redundant, the manager would lose his/her competitive advantage.

4. You can bury bad news
Once you control your MI, there is a great temptation to only publish the data that supports the story that you want to tell. If any of your data contradicts the narrative, then it's just not important and often left out. Part of the climate sceptics arguments is that climate scientists are accused of omitting the results that do not fit with their hypotheses.

Centralising all management information functions brings a lot of important synergies to medium and large companies. But more importantly, it takes the figures out of the control of the departments who have a vested interest in their results. As a result, conflicts of interest occur less, and the data is queried and presented fairly.

To ensure this happens, it is vital that a Business Intelligence Competency Centre should almost be running as a separate entity from the rest of the organisation, and therefore free of the political control of vested interests from other parts of the organisation.