Musings of a Data Geek

Monday, 18 February 2013

A weekend with Talend

I am constantly surprised by the wealth of open-source software that is available for use. So when flu struck myself and my family this weekend, which put paid to our plans, I decided to do some software evaluation. Talend have been occupying the 'visionary' side of Gartner's Magic Quadrant Data Quality software for some time now.

Last year, I evaluated their Open Profiler, and found it useful, if a little clunky. But that is the tip of the iceberg. They also provide a Data Integration tool, which I decided to have a go with.

The version I was using is version 5.2.1 Installing was simple, I merely extracted the zip file that you download from the Talend's website, and selected the file that would run the software. There are Linux, OS-X, Solaris and Windows options that all come packaged and ready to run in 32 or 64 bit options. The program runs on the well-known eclipse graphical user interface, so it depends upon having Java installed on your machine.

Once opened, the program took time to fully load all of the tools. But when selecting them from the panel on the right, I could see why. There is just about everything you need to be a one-man data integration specialist. It contains enough JDBC connectors to enable you to connect to just about any database.

The database I chose was an old instance of MySQL that I had on my computer for some time. I set up some dummy data and dived right in.

The whole package is just the right mix between simplicity and configuration. Extracting, parsing, joining, transforming data is very straightforward. The way the program deals with type 1, type 2 and type 3 aspects of slowly-changing-dimensions is fantastic. That function alone makes it an outstanding piece of work that should save you a huge amount of development time. All of the modular jobs have the ability to export their results and the details of any errors back into the database of your choice, or perhaps into files. This means you can produce comprehensive management information on the efficiency of your processes.

Once you have built your DI jobs, you can export them as self-contained programs that can be deployed within the package or platform of your choice. As long as javascript is enabled, the jobs will run. Before the weekend was out, I had a fully functioning, scheduled data warehouse, with a comprehensive detail layer and a presentation layer of summarised MI ready to plug an OLAP portal into.

There are some limitations to this package. If you are to work with a group of analysts, a shared repository is vital. However, you have to get the enterprise version for that, and that doesn't come cheap. But leaving that aside, Talend have to be congratulated for putting together quite an impressive piece of Data Integration software.. Honestly, just I can't believe it's free.

Next week I will be attending Talend's roadshow to see their new developments in the data science discipline of 'Big Data'. I will let you know how it goes.

Saturday, 9 February 2013

Data lineage - a cautionary tale

Recently, laboratories in Ireland discovered that some of their processed beef contained horse meat. The public were outraged, and the Food Standards Agency insisted that all retail sold processed beef in the UK was DNA tested for horse meat. Some high profile branded processed beef products have been found to contain 100% horse meat.

Now the general public are very concerned that they have been eating food that could have been contaminated with chemicals that are used in the rearing of horses.

A senior politician was quoted to say, "We need to know the farmer and the meat processor."

It seems that the modern processing of meat has become a very complicated business, with different parts of animals being moved from one company to another. No-one truly knows where their processed meat comes from.

You may be surprised to hear that there are many companies who deal with data in a similar way to the meat processing industry. They may know where the data is manufactured, and where the results appear in reports, but it is the processing in the middle that they don't understand.

Data may arrive in a database, then get extracted, parsed, standardised and moved from one mart to another. It may be summarised and moved into multiple spreadsheets, where adjustments are manually made and then the data is re-extracted into other systems before finally finding it's way into a report. The full map of systems, processes and departments involved in the processing chain may not be known by just one person in the organisation. It is also unlikely that any of it is written down!

The understanding of how data is manufactured, processed, stored and used is called 'data lineage'. The financial industry has already addressed the problem of companies not knowing their data lineage by the EU directive Solvency II. Although it is not in force yet, the value in understanding data lineage is now becoming law.

If your company is large, tracking your lineage may be an expensive business. Certainly, the software is very expensive. Such costs may be hard to justify in the present financial climate, but doing it now, on your own terms, is far cheaper than waiting for an angry public and government legislation to force you to do it.

Sunday, 3 February 2013

BI Centralisation - The Challenges

Perhaps your company has acquired a number of other businesses. It could still be suffering from the remnants of older ways of working. But a quick look around many companies will show that management information is being generated in many different areas, with varying levels of accuracy.

This can cause you a great deal of problems with multiple, conflicting versions of the same measure being manufactured throughout the organisation.

Many organisations are looking to build Business Intelligence competency centres, by pooling resources, systems and processes into one area. This has many advantages. But you may encounter some fierce opposition to your plans. Here are the top reasons why people will oppose your plans:

1. Exclusivity
People like to manufacture their own MI because it gives them a first look at the figures before everyone else. So if you're in the sales department and you manufacture the sales figures, you see them first. You can start thinking up excuses as to why you haven't hit your targets way before anyone else knows about the results.

2. The illusion of control
I'm not sure why this happens, but departments like to control their MI, because somewhere in their heads, it implies that they can control the business itself. Manufacturing your own MI only brings benefits if you do not have a data quality department.

3. Analysts have become too powerful
Very often the MI analysts know more, and make more business decisions than the managers. If their analysts were sucked into a centralised department, or made redundant, the manager would lose his/her competitive advantage.

4. You can bury bad news
Once you control your MI, there is a great temptation to only publish the data that supports the story that you want to tell. If any of your data contradicts the narrative, then it's just not important and often left out. Part of the climate sceptics arguments is that climate scientists are accused of omitting the results that do not fit with their hypotheses.

Centralising all management information functions brings a lot of important synergies to medium and large companies. But more importantly, it takes the figures out of the control of the departments who have a vested interest in their results. As a result, conflicts of interest occur less, and the data is queried and presented fairly.

To ensure this happens, it is vital that a Business Intelligence Competency Centre should almost be running as a separate entity from the rest of the organisation, and therefore free of the political control of vested interests from other parts of the organisation.

Tuesday, 15 January 2013

Data management lessons from kindergarten

My youngest daughter, Mia will soon be 4 years old. She is such a little chatterbox at the moment, and is into everything - just like a typical little girl of her age.

Just before Christmas, I had some time off, so I picked her up from school. While I was waiting with the other parents, I cast my eye around the classroom, and some of the posters reminded me that some things learned in kindergarten can be applied to modern data management.

1. Hold hands before you cross the road

There are risks in every part of society. Business is a careful balance of risk and opportunity. It is important that everyone plays their part in looking after each other to ensure no-one is exposed to unnecessary risk. To do this, we all have to work together and look out for each other.

2. Put things away when you've finished playing with them

Data is like any other tool in business. When you have finished using it, ensure it is stored in a secure area, where no-one can steal it. When your data is no longer required, ensure it is deleted securely and safely.

3. Sharing is caring

Re-using the same measures, sharing data sources and not re-extracting the same data over and over again is not only practical, but extremely time efficient.

4. Don't forget to say 'please' and 'thank you'

Manners are a minimum standard of behaviour. Just think what would happen if you insisted on minimum standards for your data and enforced them through every process throughout your organisation.

OK, perhaps I've been stretching some metaphors here. But I wonder how much better our world would be if we just followed some simple principles, universally.

Saturday, 5 January 2013

The Technology Trap

There is no doubt that technology has been a great enabler for mankind in general. It has allowed us to do things today, that would be impossible only a decade ago. But there are some fundamental issues that can cause problems with business..... the unnecessary adherence to a particular technology.

The most keenly observed competition over technology is in the retail smartphone sector, where internet forums are full of eager users, gleefully insulting each other about which smartphone is the best.

As long as your technology enables you to achieve your goals while being competitive, you should be happy. But what happens when your technology is actively holding your business back? What are the warning signs?

1. Updates stop happening

When updates to your existing tech stop being rolled out to you, that should be a pretty big clue that things are going to change. At this point, you are in a very good position to do something about it. Now would be a good time to start looking for other ways of doing things. You have time to plan and raise the necessary capital. Your tech colleagues who are in touch with the latest developments should be telling you that changes need to be made. However, your colleagues throughout the rest of the business may struggle to get behind such foresight.

2. Parallel technologies remove their support

Let's say - for instance - that the technology in question, integrates with an Oracle database. When this technology starts dropping from the list that Oracle publishes as being compatible with - alarm bells should really start to ring. This is the clear signal that the rest of the world is starting to diverge from your technology. However, things will still run correctly, as long as nothing else changes on your network. So galvanising interest from the rest of your business may still be difficult. But time is starting to run out.

3. New technology is no longer compatible

A department will want to implement a new piece of technology. It will rely on your oracle database infrastructure, just like your tech, but this new system requires the latest version. Upgrading will mean that the oracle database will no longer work with your technology. A painful decision will have to be taken. Do you replace your old system and update everything else, or do you keep your old tech and update everything else to accommodate the new system? It is at this point that the simple addition of some new technology starts to become extremely expensive. It is too late. You missed your window of opportunity, so unless you have real buy-in from the rest of the business, your old tech will be effectively sandboxed by updating technologies that it relies on.

4. Technicians who can make important changes become very few and far between

At this point, your tech may be limping along with limited capabilities. As you have been stubborn with not changing your outdated system, you are unlikely to have any staff who can help you now. Unfortunately, the best analysts and developers will have spotted that your tech is no longer relevant, and left to pick up higher job market-value skills elsewhere. So what remains are colleagues desperately clinging to irrelevant skills, and who may have a vested interest in avoiding change. This is where change becomes even more expensive, and with high likelihood of failure.

So how can we avoid such a problem from happening? Here's my view:

Promote a culture where capability and agility is prized over technology. Make sure that everyone understands that all systems have a built-in obsolescence, based on the pace of change.
Make sure your systems are kept up to date. Your IT hardware procurement and software licensing colleagues should be key to this. They should be watching all your infrastructure and predicting where software releases and/or hardware upgrades diverge in compatibility.
Keep a close eye on the errors raised by colleagues throughout the business. Watch for evidence of 'shaky' behaviour by systems that may be on the limits of compatibility.
When purchasing for new projects, make sure any new software and hardware is in line with your company's anticipated upgrade schedule for your infrastructure. (i.e. it is no good buying software that runs on Oracle 11g now, when your company is running on 10 and won't be upgrading to it for another 2 years.)

Most of this is common sense. But the clear message is that the longer you wait to change, the more expensive it becomes. There is a sweet spot to avoiding the "hyper-expensive computer fads" that may not give good value, while recognising when to change quickly to avoid system constraints and institutional inertia.

Friday, 4 January 2013

Viruses, Worms and Trojan Horses

Ask anyone about them, and they will tell you that computer viruses can be a problem. I conducted a quick straw-poll amongst some of my friends (ok, they're geeks too, so it might not be the most representative sample) and almost everyone told me they had a virus at least once in their home computer. However, only a couple could tell me how they caught them, or the difference between the 3 types of malicious programs - Viruses, Worms and Trojan Horses.

Viruses are malicious programs that are attached or embedded within files or programs (hence their name). Almost all viruses are attached to an executable file, which means the virus may exist on your computer but it cannot infect your computer unless you run or open the malicious program. A virus cannot spread without people using them or passing them on. They most commonly exist as email attachments.

Worms are slightly different. Although very similar to viruses in their ability to cause damage to computers and their files, they do not require people to exchange them, as they exploit any system that allows computers to exchange information. Once infected, a worm may take your email address contact list and send thousands of emails - all with the worm attached. Worms can grow exponentially, very quickly. Read about the most famous worm here and here.

Trojans Horses (more commonly know as 'Trojans') are designed to mimic useful programs - like file-sharing or even anti-virus programs. The user downloads, installs and runs them under the belief that they are getting legitimate software. However, they are far from the truth. Trojans can be used for many purposes. Some make you believe that your computer is infected with viruses and puts you in contact with a help desk who charges you a lot of money to "fix" the "problem". They also likely to be used to spy on the user, extract information from the computer, or gain remote control over it.

As these programs have got more sophisticated, the difference between Viruses, Worms and Trojans has been blurred. Programs have been designed to use multiple modes of transport. A worm may travel and spread through many routes including e-mail, IRC and file-sharing sharing networks. It may also do a number of things - like damage files, then install a back-door that allows remote control of the computer or access to the data. The most advanced versions create botnets.

Botnets are worm programs that spread and work together to build a network of computers that can be controlled by an individual. These networks can be used to attack internet based services.

So, how do you avoid these problems? You take a layered approach:

Educate your users about risky behaviour
Block malicious websites
Do not mix business and personal use on computers
Keep operating systems up to date
Firewall your systems and networks
Regularly scan your computers using antivirus software
Keep your list of virus definitions up to date

Tuesday, 1 January 2013

Device dilemma

When the first mobile phones arrived, few could have predicted that they would evolve into small personal computers with highly interactive user interfaces. Little did we guess the amount of data they can now hold!

Then along came tablet computers - just as capable as the smartphones, but with larger screens. Coupled with a convenient cloud computing service, the sky is the limit!

For the first time, computers are becoming truly user-friendly. My 3-year old daughter is a whizz on her iPad, and routinely does things that I would not have dreamed of doing when I first programmed a computer in the early 1980s! My father - a confirmed computer luddite - really enjoys using his android tablet. But for every innocent person who envisages wonderful new applications for these devices, there are an equal number of less altruistic ideas being explored. Does your head of IT security have trouble sleeping at night? With the rise of these new machines, I can see why he/she might become a little restless!

There are many companies who take the stealing of data very seriously and enforce a zero tolerance policy for such devices, choosing to ban them from secure areas within their organisation.

Others have chosen to take a more relaxed approach. They have decided to allow people to use their own gadgets for work. There is a third option where the company may let users use business devices for personal use as well.

I am in favour of clear distance between business and personal use - never the twain should meet. This effectively means making sure business systems and processes cannot under any circumstances be accessed using personal devices. While some businesses may be less inclined to protect their data, the risk to themselves may be small, but they could be putting their customers in danger.

Personal use also is more prone to the kind of social engineering that viruses and trojan horses exploit to infect your machines and/or steal your data. Mixing computer business with personal use is a very risky thing to do.