Musings of a Data Geek: 2013

Thursday, 26 December 2013

2014 Tech and Data Predictions

What a year 2013 has been. The smartphone culture is fully mature in the first world, with this year being the advent of the stripped down versions for developing countries and budget markets. Edward Snowden has sensationally brought privacy into the public spotlight, and it turns out that the conspiracy theorists were right!! It is at this time - between Christmas and New Year when I like to think about what will happen in 2014. Hold onto your hats, it's going to be a bumpy ride !!!!!

1. The mobile revolution continues

Although mobile telephone usage is peaking, 2014 will be the year when the wearable device becomes mainstream. Google's glass (Smart glasses) and Samsung Galaxy Gear (smart watch) have been the innovators, but Apple is bound to be breaking into this with a brand new smart watch. With this new device (and Apple's brilliant N.L.P. marketing), I expect an explosion in wearable devices.

2. The internet of everything - not yet

While there have been some cars with internet connections, smart ovens, smart fridges and other ubiquitous devices in the home are still a long way off becoming mainstream. This is largely due to the old devices having a long operational life before they need replacing.

3. The year of the personal cloud

As mobile and wearable devices become smaller and more personal, expect network hard drives and the building of personal clouds at home to be a central repository for all of your data. People will then be able to access their data via the internet, connecting to a VPN based in their home. Cloud service providers will suffer as the Snowden revelations in 2013 have driven cynicism about other people managing data for you.

4. In business, the analytics explosion continues

Business leaders will want to integrate more and more disparate data sources. This will drive the need for big data solutions. Wearable and mobile devices will transmit more and more useful location data. Visualisation techniques to overlay location data (and location movement information) with other metrics and measurements will need to be developed.

5. Big data moving out of IT and into business areas (almost)

Analysts will be able to utilise their SQL skills to analyse "big data" using arrays of computers. However, the preparation of the data so that it can be used in this environment will still have to be done by IT. Expect teething problems and delays with this approach.

6. Analytic skills become even higher premium

As the demand for data analysis continued to expand beyond the job market's ability to deliver qualified people, 2013 saw a feeding frenzy with recruitment consultants poaching experienced staff across industries. With more and more analysts deciding to call themselves "Data Scientists", expect businesses to pay high prices for analysts in 2014, and even higher prices for the genuinely talented ones. As a result, 2015 will see solution providers receiving extreme pressure from business owners to simplify their solutions, and therefore drive down the future wage cost.

7. Data management principles will continue to be ignored

Business owners will continue to ignore the fact that large proportions of their data is incorrect, and blame the data consumers and analysts for deriving 'incorrect' results. They will expect management information analysts to 'code around' data errors, rather than managing and fixing the data as an asset that is separate from the systems. In the end, this will cause problems when migrating over to newer and better technology (i.e. big data).

8. Governments - new privacy laws and more internet censorship (Addendum)
2014 predictions would not be complete without the fallout from the Snowden revelations. Governments will rush in draconian privacy regulations. Their security services will largely ignore these regulations and continue their surveillance programmes. However, private corporations will have to comply, choosing to pass any cost onto their customers. Expect more 'internet' related scare stories as governments seek more excuses to further restrict the flow of information between people, now that they have filter technology in place with all of the internet service providers.

Best wishes for peaceful, prosperous 2014.

Richard

The Data Geek

Wednesday, 25 December 2013

Merry Christmas

Here's wishing you all a happy, safe and peaceful Christmas this year. I hope all of your Christmas cards arrived at the right address and on time ;)

All the best,

Richard

Saturday, 21 December 2013

Who is driving your data management?

There was a time when all businesses really cared about was profit and process. We adhered to process, ticked the right boxes, kept the costs down, and collected our pay cheques at the end of the month. The customer was largely ignored and left to their own devices. Then as customers started selecting the best customer service, and as business management techniques improved, the customer became king. Delighting the customer was a mantra that pushed us on. But now something truly interesting is happening.

Our customers are becoming tech savvy.

First, it was businesses who insisted that we communicated electronically. But slowly the general public have become more and more knowledgeable about what computers are truly capable of. Now, almost everyone has a computer. The common user makes daily decisions about managing the applications and data on their personal smart phone, and whether they need cloud storage and backup strategies.

I have an application on my iPhone that allows me to match and merge duplicate contacts in the contacts application - all simply at the touch of a button.

We have all become data managers.

So these new tech savvy customers will naturally expect high standards of data management from the organisations that they do business with. We need to be ready. We need to know our data lineage, so we can plan and execute change successfully and swiftly. We need to optimise our data in every aspect of our business, so we can wring the last drop of value and opportunity from it. We need to know that "Jon Smith" who bought one product from us is the same person as "John Smith" who holds several more products.

Customer ignorance is no protection to businesses any more. Data accuracy and value needs to be the new mantra - because now our customers expect it.

Saturday, 14 December 2013

Big data - a conceptual example

I recently took my children on a trip to Jodrell Bank. For those who do not know, it is a very old radio telescope observatory in Cheshire that was build by Sir Bernard Lovell.

Although Jodrell Bank is getting very old, and is being overtaken by space based telescopes like Hubble, it is still a shining example of ingenuity and science in action, and it still has a vital role, gathering information about pulsars around the galaxy.

If we consider that science is the collection of data, it is surely best to acquire the greatest amount of data possible. This is why telescopes got larger and larger - to catch more data.

But science wants to collect so much more data than just one conventional telescope can manage. So what they have done is to stop building large telescopes like the Lovell telescope at Jodrell Bank, and instead build arrays of smaller dishes.

An array is a set of radio telescopes that are controlled together and pointed at the same part of the sky. The information is collected from all of them and put together. Modern Big Data works like an array - because one large computer is just not good enough any more. Data has become too large, varied and complicated for it.

Big data solutions are arrays of computers that are joined together to process very large and complex data sets. They employ special hardware and software to ensure all of the work is shared across the computers in the array. The end result is that huge, complex data sets can be processed much quicker than before.

Saturday, 7 December 2013

Tips on how to get on in Tech

You may be surprised to hear that I did not originally want to work in data or technology. In fact the idea of sitting at a desk for any length of time used to fill me with terror. When I did find myself working in an office, I gravitated towards technical roles because they interested me. Some people are not so lucky, and may find themselves having to do something they find difficult. Here are some tips to get you more proficient in the technical aspects of your role:

1. The internet is your friend.

Believe it or not, whatever piece of technical equipment you may need to operate, there are forums somewhere on the internet, dedicated to people sharing knowledge. Join up and share your problems. There is a wealth of support out there.

2. Cultivate a strong sense of curiosity

The great thing about tech people is they like to share their knowledge. If you have colleagues who are technically proficient, swallow your pride and ask them. Don't forget - there is no such thing as a stupid question. Ask, then shut up and let them tell you.

3. Apply practically
An idea or concept is useless unless it can be put into practice. So if you learn something, look for ways in which it can be applied and implement them - before you forget it altogether.

4. Don't be precious about your methods
You've learned a programming language or a package. It's a great sense of achievement. But technology is always on the move. New things come along all the time. Sometimes that means we need to discard old ways of doing things to improve. Tying yourself to one package may give you problems when industry changes direction.

Sunday, 3 November 2013

When moving is a problem

Mobility of staff within an organisation can be perceived as a good thing - a result of an efficient HR department and a company that likes to remain agile. However, this causes an often neglected flaw in many companies and how they administer the system access of their employees and the resultant authorisation.

Many organisations have processes that involve multiple people. This is often for good reason, to spread risk and introduce checks and balances to prevent fraud or human error. Imagine that an employee starts off in a role within the organisation in the front office, answering phone calls. Let's call him Bob....

Bob has a set of screens that allow him to request functions for the back office to complete. Perhaps one of these functions is requesting refunds for defective products for the back office to process.

Then Bob gets a promotion that allows him to work in the back office, printing the same refund cheques that he used to request while he was in the front office. Most organisations will add the new functions, but they will rarely take away the old redundant authorisation.

Bob now has access to both front and back office functions that enable him to request and print cheques without scrutiny from the rest of the organisation.

So I would ask - how many of your colleagues have been put in this position? Are they even aware that they are exposed to fraud and excessive operational risk? Conduct an audit. See for yourself. You may be surprised by the results.

Saturday, 26 October 2013

How secure are your passwords?

In this non-stop, always-on digital world, it's not unusual for the general public to have a large number of user ID's and passwords for various web sites. Many sites like Amazon, Ebay and Paypal also hold your credit card details for convenience.

Take this into the business world, and a user can have access to many technical areas on servers all over the place:

Databases
Platforms/servers
Web applications
Secure FTP areas
Personal computers
Mainframe applications

User authentication is extremely important to make sure the right people can conveniently access their systems and services, while preventing unauthorised exploitation.

Companies may choose to give their users a generic user ID that can be used for all of their systems. General public websites often ask users to use their email addresses as user ID's. This puts increased importance on the security of the password for each system. We can use Entropy (my favourite nerd term) to measure of the effectiveness of a password.

Entropy is the level of disorganisation within a collection of related objects or components of a system. So in a password situation, Entropy is used to measure the level of unpredictability between each character of a password. The higher the entropy, the more secure your password is.

There are a few different ways to find out the correct password:

Stealing
Social engineering (misleading you into divulging your password)
Guesswork
Brute force

A high entropy sequence of characters will make your password impervious to guesswork and more difficult to gain access through brute force.

Guesswork involves using knowledge of popular passwords, like '1234', 'admin', '9999' etc.

Brute force involves the use of a piece of software that bombards the application with multiple passwords until it finally hits the correct one. So the higher the Entropy of your password, the longer it will take for the program to discover your password.

There are many precautions we can take to secure our information from hackers, governments and thieves. This is the first in a number of articles in which I intent to raise awareness of information security for the normal user, and why we all need to be vigilant in the workplace.

Saturday, 5 October 2013

Corporate laughs

Is business intelligence a contradiction in terms? How come so many smart people can often come together to mess things up so badly? Applying cold, hard logic to spontaneous communication can be hilarious. Here are some of my favourite corporate faux pas:

“As of tomorrow, employees will only be able to access the building using individual security cards. Pictures will be taken next Wednesday, and employees will receive their cards in two weeks.”
(Microsoft Corp. in Redmond WA)

“What I need is an exact list of specific unknown problems we might encounter.”
(Lykes Lines Shipping)

“E-mail is not to be used to pass on information or data. It should be used only for company business.”
(Accounting manager, Electric Boat Company)

“This project is so important we can’t let things that are more important interfere with it.”
(Advertising/Marketing manager, United Parcel Service)

“Doing it right is no excuse for not meeting the schedule.”
(Plant Manager, Delco Corporation)

“No one will believe you solved this problem in one day! We’ve been working on it for months. Now go act busy for a few weeks and I’ll let you know when it’s time to tell them.”
(R&D supervisor, Minnesota Mining and Manufacturing/3M Corp.)

Quote from the Boss: “Teamwork is a lot of people doing what I say.”
(Marketing executive, Citrix Corporation)

My sister passed away and her funeral was scheduled for Monday. When I told my Boss, he said she died on purpose so that I would have to miss work on the busiest day of the year. He then asked if we could change her burial to Friday. He said, “That would be better for me.”(Shipping executive, FTD Florists)

“We know that communication is a problem, but the company is not going to discuss it with the employees.”
(Switching supervisor, AT&T Long Lines Division)

Have a great day.

Sunday, 29 September 2013

Retention policy - a fundamental principle

There are a lot of annoying things in life you naturally expect - like taxes, bad television and crime. But I did not expect such vehemence about a recent article I wrote about the value of data and the conflict of interest it has with cloud providers.

It seems there is a ground swell of opinion in the technology area that all data should be retained, as valuing it and prioritising it is too hard to do. Furthermore, these people (often cloud providers) could not see the harm in keeping all their data indefinitely.

There are some very good reasons why you should be scrapping data on a regular basis, and here they are:

1. It will save you money

2. It will make you compliant with regulation

3. It will improve the trust with your suppliers and the general public
4. As reality changes, your data will degrade in quality over time until it is little more than useless.

But the most important reason is something a little more fundamental. People have a right to make mistakes and have them forgiven and forgotten. People also have the right to change their opinions and image. They have the right to reinvent themselves and move on. If we cannot discard what did not work, how can we move on in a healthy and positive manner?

Friday, 20 September 2013

Questions to save you from data quality meltdown

In a large, busy organisation, inevitably, there are a lot of problems. Many issues can be assigned to data quality. But one of the biggest pitfalls for a data quality team, is taking on too many assignments.

Colleagues with over-simplistic viewpoints may use the data quality department as 'long grass' to conveniently kick their problems into. Here are some questions to ask yourself to prevent a data quality team meltdown.

Is it really a data quality problem?

A popular mistake is to assume that operational or functional problems are 'data quality'. For example - if a telecommunications company keeps debiting a customer's monthly charges, even though they cancelled their mobile phone connection, it is not a data quality problem. It is an operational problem. The data correctly reflects what money the customer has paid. Spot these kinds of problems early and remove them from your inbox.

Is there a conflict of purpose?

Systems, databases and data marts get built for specific purposes. It can be tempting for analysts to try to use them for other purposes. When the data doesn't work as expected, they may declare that the data needs 'fixing' so they can use it. These are not data quality issues. They are issues for developers to solve.

Is it a nomenclature issue?

Naming terms can be a problem. Departments may have different names for the same things - or even worse - use the same name for two completely different things. Push the problem back to them until they can articulate their technical terms in plain English. Don't be afraid of sounding stupid to ask for this. It can uncover a lot of underlying problems.

Is there a migration issue?

When data gets migrated from one system to another, not all of the data for the new system will be contained in the old one. Very often these missing data items will either be blank or have default values. Although they are data quality issues, they cannot be fixed because the data was never collected in the first place. Get to know the start dates of each system, and the limitations of any migrated data. It could save you a great deal of running around.

Are you trying to boil the ocean?

There are some issues that require large-scale intervention. Be honest about your capabilities, and get the correct resources assigned. If this means rejecting issues that are too large, reject them until the right resources become available.

These are all common-sense questions to ask yourself before accepting data quality issues. if you have any others, please use the comments below.

Saturday, 24 August 2013

3 sure signs you need a data warehouse

In these challenging financial times, it is easy for those controlling the purse strings to ask, "Why do we need a data warehouse?"; Particularly among small to medium sized enterprises, where they may believe they are running quite nicely without one.

So what should a manager be observing that gives them clues they need to think about a data warehouse?

1. Your legacy data is expensive to access

You may acquire segments of your business through mergers or acquisitions. You could have historic databases containing old products or services. If it's taking a lot of time (and therefore costing a lot of money) to access this data and make sense of it, then it could be time to migrate it into a data warehouse.

2. Management information is not joined up

You have to get reports from different parts of your business, and they don't seem to agree - even though you operate from the same customer base. The reason for this is that your organisation is not integrating the data correctly and consistently. The best way to integrate data for a single version of the truth is to build a data warehouse.

3. Adhoc reporting is slow

All you wanted was one extra data item adding to an existing report - how come it's taking so long to do? Quite simply your measures are 'point and click' solutions that have their own ETL functions. So whenever a change is required, the whole solution needs to change and be re-ran. A data warehouse will already have the data in it's most useable form. All of the processing will have been done over night, and your analyst just needs to add the new field, query it and present the results.

Data warehouses are a practical and common-sense way of consolidating important information from all over the enterprise. They are expensive to set up, but once implemented, they will give your organisation a high level of reliability and stability.

Friday, 28 June 2013

5 things to consider when joining data

Warning - we are in major geek territory here. This assumes basic understanding of SQL. You have been warned!

OK, you have been asked to get some data together for a project. The data is scattered across different tables and needs to be joined together before you can get any insight from it. Here are 5 things to help you avoid some of the pitfalls that can happen.

1. Consider location carefully

Your analysis software is very powerful. Many business intelligence packages can allow you to join a table in one server to a table in another server, without ever considering whether it is a good idea to do so. I once heard of a user in Cheshire who tried to join a couple of million records on a server in London to half a million records in Hong Kong, and was surprised when he brought his entire network down. Don't do it! Extract your data from one server, place it in a temporary area on the other server, and join the data there. Or even better, import both tables to a local server before joining them. It may be a little more inconvenient to code, but don't cross your network administrators else ye will pay a terrible price!!

2. Keep your joins simple

Yes, it's flash and kind of impressive if you join 5 tables together within one SQL statement. But one day, someone else is going to have to examine your data when it is challenged. They are going to get to your large SQL query and realise that the problem is somewhere in the tenuous joining you have done. It's much better to join one table to anther; then use the product of that join to another table etc. It takes longer to code, but means that gaps in referential integrity are much easier to spot.

3. De-duplicate the keys before joining

Just do it. Make it a habit. Even when the data looks right. One day it might be wrong.

4. Use indexed fields

Ideally, you should be using indexed fields for the joins and also any other selection criteria. Indexes vastly improve processing times. If the fields are not indexed, you can add an index when you import the data.

5. Outer join and coalesce

When joining tables, consider what you want to happen to the records that don't match. It is a lot simpler to 'left outer join' the tables so that the missing records still appear in the query results. You can then use the coalesce function to add in an identifier for your missing data. It makes your reports more transparent if bad data can be categorised. It keeps your organisation honest and makes your results easier to check.

Saturday, 22 June 2013

Big Data - Keep calm and carry on?

It is an uncomfortable truth that oil has been a contributing factor of nearly all recent international conflicts. So when our Big Data evangelists tell us that "Data is the new oil," it may be more of a problem than we currently understand.

Julian Assange recently stated that the internet has become a militarised zone. He was responding to the revelations about the surveillance system, 'Prism' that was recently leaked. But is this something new?

Whether we like it or not, the internet and criminality have always been uneasy partners. Everyone gets phishing emails. We all know not to follow links for generic Viagra or bank password resets. We also know that those polite and badly worded emails that you have inherited millions of dollars are a bit too good to be true! You only need to see the amount of virus definitions that your anti-virus software updates on a weekly basis to see that computer spying and infiltration have been an accepted way of life for many years.

We have become so accustomed to implementing our own defence systems (firewalls) and counter-measures (anti-virus) that we have become blind to the reality. There is a war on for our data. The Americans haven't just invented spying. It is a burgeoning international business. How could we possibly forget the Leveson revelations of computer hacking by the red-top journalists of Fleet Street? How about the Stuxnet worm that that was allegedly developed by an alliance between Israel and USA intelligence? China and Russia have also been implicated at other times.

It is clear that Stuxnet was the first reported government sponsored worm that achieved successful military sabotage. This worm was used to target Iran's uranium enrichment programme by causing major failures in their specialist software that was manufactured by Siemens.

So as data becomes increasingly valuable, it becomes an even greater target. With spying comes other activities of warfare - destruction of property and the disabling of capability.

The new "Big data" warehouses being built are of such value and importance that they may be too big and important to fail. We are becoming increasingly dependent on our data. Distributing computer operations over large arrays of nodes decreases risk by spreading operations over a collection of cores, but this complexity also increases the possible points of failure and the ease with which sabotage can happen.

With the rapidly growing rewards of big data, we must take great care to understand the risks. We are now building ubiquitous systems of such importance that they become targets on a military-industrial scale.

Saturday, 15 June 2013

The end of social media part 2

In a previous blog article, I mentioned how social media was being increasingly levered by corporate interests, as users were facing complex privacy options and more intrusive advertising.

I now want to address the affect of America's Prism surveillance system and possible implications to the social media business.

For Facebook, this is particularly damaging, as it undermines the credibility of their already unpopular privacy settings. There are still many fundamental questions that need to be answered. The Prism documents refer to data being directly acquired from internet companies. Facebook and Microsoft both deny culpability, quoting their stats about formal information requests. While it is possible to intercept information between computers, it becomes far easier if they are complicit in the operation.

Facebook has been troubled by pro-child-abuse pages popping up. At times it has appeared that Facebook does not have the resources to bring them down fast enough. Cyber bullying is on rise, too. Children across America are being shot in school. These are problems that Prism would be excellent at confronting. It is clear that from Facebook's continuing problems and the recent escalation of shooting in schools that the NSA are not protecting children in the USA.

This suggests to me that they are either not getting the results they wanted, or they are only picking projects that cannot be traced back to Prism.

So if prism isn't being used to improve the social media experience, suspicions will run high as to the intentions of the NSA. This does not bode well for the reputations of the social media providers that fall within Prism's reach. It also asks questions about the increasing militarising of the internet.

Will people turn away from social media? Do people want to be monitored 24/7? What are the guarantees that prism won't be used for 'special political interests'? Two things are clear - 1. It is a tough time to be a social media provider 2. The full truth has yet to come out.

Friday, 14 June 2013

5 steps to valuable data quality measurement

Kaplan and Norton's Balanced Scorecard is a way of defining and measuring strategic performance . It is probably the most used tool in management today. A data quality department may wish to have their progress measured on one or more of the quadrants.

But the full corporate scorecard itself can provide great guidance as to the strategic direction of data quality measurement and remediation for the whole of your organisation.

1. Identify key data items

For each goal in your balanced scorecard, identify the fields, tables and databases that are critical to the delivery of the scorecard objectives. If you can, also include the data items that are being used for the scorecard measures.

2. Prioritise each data item

The key to this part of the exercise is to ensure that the most important data items for each measure get the full attention they deserve. You will possibly have a lengthy list of data items for each goal within your balanced scorecard. Cut out the items that are not important. If you still have a large amount of fields, try to give priorities and weighting to them.

3. Agree business validation rules

Once you have a comprehensive list of fields and tables, go to your business and agree the business rules that you will use to validate each field.

4. Measure the quality of your data

Apply the business rules to all fields and tables as per above. Roll all the scores up into the items that you originally started with. You now have a scorecard of data quality in relation to your corporate Balanced Scorecard.

5. Take it onwards with actions

What you have developed is a powerful baseline that informs your colleagues exactly how well the quality of data underpins your strategic corporate values and goals. Next steps are to prioritise any remedial action that is required, and agree all targets for improvements.

Saturday, 8 June 2013

The best form of defence

Recently the Guardian newspaper broke a story that the National Security Agency (NSA) in the United States of America had implemented an extensive program to acquire and monitor all on-line communications. The project has been active for the past 6 years, and acquires data from Google, Apple, Microsoft, Facebook, Paltalk, AOL and Youtube.

When you put all of these services together, you will realise that they also cover 'facetime', 'google talk' and 'skype' which are all video and internet telephone conferencing facilities as well as emails, social media. Couple this with the recent discoveries that the NSA also acquired access to all of Verizon's telephone communications, and you have the largest, most insidious and far-reaching national and international communications surveillance program of all time.

This level of surveillance displays paranoia on such an industrial scale as to make the cold-war McCarthy witch-hunts seem like a storm in a tea-cup. It is quite right for everyone to be extremely concerned. As immediate allies to the United States, the UK government is quite rightly under extreme pressure to disclose their involvement. The internet service providers are at present denying culpability.

Which brings me neatly to my conclusion. The best defence is trust. If your customers know you are doing the right thing with their data, they will stick with you. If your data is wrong, or you are doing unethical things with it, expect trouble. Good data governance ensures that you honour your obligations to your customers, and prompts you to question when your government asks for too much.

Thursday, 6 June 2013

Size vs value

It is a widely quoted stat that business data is expected to more than double every year. This is due to a perfect storm of the increasing proliferation of data generating devices, and the rapidly decreasing costs of collection and storage.

The rise in use of cloud technology proposes enormous advantages to business everywhere. This cannot be understated. Simply put, using 3rd parties to store information means that companies no longer need to pay large amounts for server space that they may never use. Instead, they pay their cloud provider just for the space they use.

While this is a fantastic idea, we can easily get sucked into storing data, just because we can. Because data is measured and more importantly charged by size, it creates a market demand for data, not because it is useful, but because it is an asset that provides an income.

A recent Digital Universe study found that only 0.5% of all data is actually analysed. So is the 99.5% useful, wrong or just waiting for technology to catch up?

The much promised 'Big Data' solutions have not achieved critical mass within the IT industry, with people talking about them more than implementing anything. So what is happening to all this excess data? The truth is, we are all paying for it in one way or another, in the price of our goods and services or the tax into our governments.

Data size is only important to cloud providers, as that is how they choose to charge people. The challenge is for everyone to find a better way to assign value to their data. Only then can we keep the data that can take our lives forward and reject the waste that is clogging servers all over the world.

Friday, 31 May 2013

Brave new world... are we there yet?

Following recent high-profile terrorist activities, one of my friends asked me whether it was possible to leverage the latest IT solutions to monitor every email, chat room and the internet message for terrorist-related activity.

Setting the legal, privacy and personal freedom issues aside, let's take a pragmatic approach. It is easy to look at the new big data solutions and the amazing hardware that is available and say "in theory, it's possible to be done".

In theory......

Firstly, have a good honest look at the data on your systems. How good is it? Perhaps 80-90% correct?

Let's assume that there are about 1,000 terrorists in the UK. We have approximately 64 million inhabitants. If we have access to every piece of information that is exchanged, even if the data is 99.9% accurate, we will probably need to arrest and question 64,000 people, just to capture 1,000 terrorists.

That would annoy a lot of people!

Now given that your computer has perhaps 90% correct data and that others are far worse than you... How much data do you think is actually correct on the internet? How much information that people exchange through the internet is truth? It's a darn sight less than 90%.

Then you have people using code words, different languages and 16 bit encryption.

We can build the most powerful computers in the world, and the most sophisticated solutions, but unless the data is correct, we don't stand a cat in hell's chance of getting anything right. A system is only as good as the data that it holds.

So don't expect minority-report analysis of our data with predictions of where bad things will happen next any time soon. For if the truth is known, these amazing new systems will probably to fall apart when we put real-world, inconsistent data into them.

Tuesday, 28 May 2013

Defection to Linux

It's official. I have changed my home computer's operating system over to Linux. I have had an iMac for the last 6 years.

It has been a great piece of kit, and has delivered stable performance throughout that time. There is no doubt that when I bought my iMac, it delivered the very best computing experience that was available at that time.

But things change.

The new version of OSX - mountain lion is prohibited from my machine, due to it having too low specification. This was to be expected, as my iMac only just runs Lion well, and has become extremely slow and more than a little buggy. I decided to check out the new range of Apple computers. Frankly, I was disappointed.

All of their machines in the new range do not have drives on them any more. For machines over a thousand pounds, you should expect them to be bristling with features.

I have watched over the years as OS-X features have become increasingly restrictive. The way iTunes' digital rights management locks you into the apple revenue stream is the clearest example of this. With the advent of the software centre and the removal of the optical CD/DVD drive, they are moving ever closer to dictating to the users what they can and cannot do with their own computers.

For me, removing the CD/DVD drive from all of their new range of home computers was the last straw. I decided that I wasn't going to cave in to Apple's plan for the designed obsolescence of my machine and the locking of me into their revenue stream.

I first experimented with linux operating systems running on virtual machines (mint, ubuntu, crunchbang, mageia, debian, fedora to name a few). When I found the one I liked, I partitioned my iMac and dual booted into it. The performance is outstanding compared to Lion, and it is rock-stable. I have TOTAL control over my computer, and if I don't like what is on it, I can make any changes I want.

Linux has breathed new life into a computer that I thought I would need to trade in for a newer model. I am back to being happy with my iMac, and I predict that I have extended the useful life of my computer by another 3-5 years.

Saturday, 25 May 2013

Data Quality - when the goalposts move

Abbreviating, shortening or simplifying language is not new. In England, the first abbreviation system called 'shorthand' was introduced in 1837, and was designed to record meetings or dictations and quickly found favour amongst secretaries all over the world. Businesses are always trying to find ways of speeding up communication with acronyms. Also, who could have forgotten the craze of CB radios in the 1970s and 80s?

The internet and text language/speak have introduced some real changes in grammar, spelling and syntax. For instance, there is text speech, shortening words like "U" instead of you. Joining acronyms into one word delimited by capital letters and then adding a popular file extension to make the word look like a file name is an Internet forum trick - "can't believe whats happening" becomes CantBelieveWhatsHappening.jpg

Twitter's limiting of communication to 140 characters has forced the use of hash tags to precede search titles - i.e. #DataQuality. Also referring to people's user names, you precede them with an @, like @TheDataGeek (which is my user name - follow me)

What makes text speak stand out is the sheer speed and scale upon which it has been taken up by the international community. Integrate this with the internet, and you have one of the most significant changes in international communications in modern times.

While the greater corporate interests are well known, some interesting social trends are beginning to emerge. People are starting to use text speak while filling in formal documents like CVs and business letters. Presently, this is frowned upon, but soon we will have to amend our algorithms to allow for them.

How long will it be before some people will want to put a delimiting character before their name? The possibilities are endless:

My name could become @RichardNorthwood or @richardnorthwood or even #RichardNorthwood

Could the &, @ or # become new gender neutral salutations? Will people start using other symbols in their names? Could we see the removal of spaces between words? Could we see a further simplification of the spelling of words? It's possible. Whatever happens, our information systems must evolve to cope with the biggest change in the way we communicate since the Gutenberg printing press.

Friday, 12 April 2013

New incentives for data quality

Few would argue that the current international banking crisis is the biggest example of corporations working in an unsustainable manner. Barclays Bank's Salz review placed the blame squarely with their culture of short-term gain.

But a careful appraisal of nearly every sector of our society will find organisations locked into sort term, mechanistic behaviour patterns, with over-emphasis on attracting new business, and not caring about the customers and services that they already have. What is more fundamental is that millions of ordinary workers across the globe are currently incentivised and paid based on these values.

But what has not been clear, is how such cultures should be replaced, and how we are to keep our workforces correctly motivated. A clue can be found in recent legislation aimed at banking (Basel 3, Solvency 2).

More emphasis has to be made on getting things right.... first time.... every time.

How do we do that? We measure the quality of the data that our businesses manufacture, and we reward ourselves based on how well we are doing. Data quality is easily measurable, and the results are unambiguous and far clearer than a customer satisfaction survey or industry awards.

Saturday, 6 April 2013

5 signs that you are running a clumsy business

In these challenging times, agility is key to every organisation's survival. A business that cannot change will be swiftly overtaken by it's competitors. How do you know your agility is a problem? Here are some major warning signs:

1. Disparate departments cannot agree over basic facts

If your sales teams and your treasurers cannot agree on how many sales you did last month, you have a serious problem. It is acceptable for areas to have different measures, as long as the differences are understood and that everyone accepts them. If you cannot reconcile, you are well on the road to decision paralysis.

2. Process becomes more important than competence and training

When times become hard it is easy for businesses to cut their expenditure on training and focus on process engineering. Locking customers and services into engineered, mechanistic processes is easy to organise. But when a customer's requirements falls outside that process, or agility is required to make changes, it is the training and skill of your staff that will pull you through and delight your customers.

3. No-one owns the data

In medium to large organisations, it is rare for the manufacturers of information to also be the consumers of it. Without data governance to assign a responsible and accountable owner, there is little desire within the organisation to spend time, effort and money on fixing incorrect data.

4. There is a gap in knowledge between your technical and business areas
Your technical areas know how things work and keep the computer processes running without error; your business areas focus on the processes and the customer. But the content and structure of tables and databases is not understood and very few people know how to use the data or whether it is right or wrong.

5. The impact of change on data is not understood
If someone wants to add a new product, service or feature to your organisation, they do not know that a change in their area means that changes need to happen in other areas. Typically, technical systems may need rows adding to reference tables so that the data appears correctly in management information and other services. New changes may be fine in one system, but may damage referential integrity with other related systems.

This list is not exhaustive. But they are the main agility risks that good data management can provide. Do you know any more? Leave your comments below.

Saturday, 9 March 2013

Equality in Technology

Friday the 8th of March was International Women's Day. Companies like The Co-operative Bank, EON and African Development Bank spent the day raising awareness and celebrating the success of women across the world.

I am really keen on equality in the workplace. A mix of personalities and gender make for a more vibrant and interesting work experience. It made me think about the role of women in technology industries. I did some internet searching, and found two websites relating to women in IT in the UK. One was crashed, and the other had moved into a social media platform (which tells a story of it's own!). So please excuse the stats I have found relate to the United States only.

In the US fortune 500, 15 percent of companies have women on their board. However, when going into Silicon Valley, the number drops to 7 percent. All of Apple's senior management team are men.

Looking further down the pecking order, the percentage of women working in IT peaked in 1984 with 37.1% of computer science qualifications being awarded to women. This had dropped to 26.7% by 1998. This has held steady, with 27-29% of the US IT workforce being female in 2006.

So how are women doing when they get an IT job? In 2012, the top three adverse influences on women's career advancement were: Work/Life balance (35%); Confidence/Self-belief (30%); and culture of the organisation (30%).

From my own observations, I suggest that the lack of confidence and self belief is a lot higher, and may explain the reticence of women to flex their IT muscles. There have been many companies I have worked in where there was a culture of learned helplessness. One company in particular had a clinical archive document scanning system that was operated by a team of women. But when one of the scanners crashed, they rang a man from IT, who would have to walk down a long corridor and two flights of stairs - just to reboot their desktop computer and restart the application!

If more women were to choose a career in IT, I am sure they would set a positive example and raise the expectations on the resolution of some of the more entrenched problems with the IT world - poor communication skills, group thinking mentality and stakeholder management.

Tuesday, 5 March 2013

Book Review - "Getting Started with Talend Open Studio for Data Integration"

Companies can spend thousands of pounds sending their technicians on training courses. Most are money well spent. But there are many people who like to discover new techniques for themselves and prefer home study to the formal office or classroom setting. Also, freelance developers cannot always afford the four-figure costs that modern software houses charge to take their courses.

To fill this gap, there is a burgeoning industry of self-help manuals that introduce you to the software of your choice. Following my recent review of the open source data integration tool - Talend's Open Studio, I thought it might be useful to follow up with a review on an interesting publication that can help you get started using it.

The book is called "Getting Started with Talend Open Studio for Data Integration" and is written by Jonathan Bowen. It is available as an e-book ($22.94) and also in printed form with free ebook ($44.99). There is also a Kindle version at $19.47. The book can be purchased from amazon.com, amazon.co.uk, Barnes and Noble and Safari Books online.

Following a brief and important introduction, the instruction starts with showing you how to download Open Studio from Talend's website and guides you through the installation of the software onto your PC. The Talend software comes with example data and jobs. There is an appendix in the book that shows you how to install the sample data. The book effectively uses the sample data to walk you through the basics of file transformation, then moves swiftly on to working with databases.

For the database examples, you will need to download and install MySQL, which is an open source database that can run anywhere. You will also need the tools to administer it. The book gives you the trusted links to download MySQL, but you will need to refer to the MySQL documentation on how to install and get running. However, this is really worth doing. MYSQL is free and very easy to use.

Once you have MySQL running, the rest of the book really flies. You start to learn the really useful stuff for ETL, like connecting to databases, creating and amending tables, filtering, sorting, enriching, normalising and aggregating data. Once you are proficient, the book turns to automation, orchestration, file transferring and the generation of variables. You can then join individual jobs together to make flow processes that can make decisions and take different actions based on many different outcomes.

If you think that all this sounds complicated, you are going to be shocked. It is not. If I can do it, most people can. You can read about what I did in a weekend with the Open Studio here. However, if you are installing Talend on Mac OS-X or a PC with Ubuntu or any other unix operating system, the book's file paths are for windows computers (i.e. C:\\My Documents etc...) but I'm sure you can work that out.

The whole layout of the book is very straight-forward, with plenty of pictures on how your work should look. The language is simple and free of jargon. Explanations are just the right balance of detail and simplicity. It is obvious that a lot of care and consideration has gone into making each chapter informative yet succinct.

Maybe you wish to deploy and develop Talend for data integration at work. Perhaps you want to be a freelance DI developer. You may just want to run a little computer project at home. Whatever your requirements, "Getting Started with Talend Open Studio for Data Integration" by Jonathan Bowen is a valuable reference that you will use again and again.

Friday, 1 March 2013

The end of social media?

Are you on Twitter? I am. I like using it to promote these articles. It is a very good way of contacting a large audience, quickly. If you search on twitter "#SocialMedia", you will find endless lists of tweets and articles about the importance of social media for the future of modern businesses.

What is even more interesting, is if you put "#SocialMedia" into any of your tweets, expect to be automatically followed by about a dozen social media gurus! It's an easy way to build followers - if you are into that kind of thing.

It used to be that the only people who were making money out of social media, were the people who were selling social media get rich quick schemes. But the valuation of Facebook at $104 bn last year alluded to the increasing interest of large corporations in what you had for lunch!

When I first used Facebook, it was a large page with all of the content relating to me and the people I knew. As the years have gone by, a bar full of adverts have appeared on the right hand side of my page. Also, the facebook privacy rules keep changing, and I keep having to make sure that they do not use my content to sell other products. Every now and then a note appears saying "Jenny liked McDonalds. Do you like McDonalds?"

I predict that it will not be social media for much longer. It will be corporate media. And all of the fun will have gone from the medium, as every little interaction will prompt manic selling on behalf of interested businesses. The smart people will have moved on to another way of expressing themselves, and this bubble will slowly deflate into the shallow and drab procession that television currently is.

Sunday, 24 February 2013

The Politics of Data

People are finding new and interesting ways to use their computers to enrich their lives. Many are choosing to be politically active by making discussion groups to exchange ideas and challenge existing rhetoric. They form alliances and pressure groups. The internet is doing more to roll grass roots politics to the public than anything that has been before.

On line petitions have become very popular. There are websites and organisations like avaaz.org and change.org who promote ideas to their members and collect electronic 'signatures'. It only takes a politician to announce an unpopular piece of legislation on television, and they can expect to be handed a petition signed by tens of thousands of people from one of these websites within a couple of days.

This kind of instant feedback to politicians must be very useful to gauge public reaction and prevent policy mistakes. It is also doing lot to re engage politicians with a public who have often been marginalised by intense political lobbying from corporate and foreign interests who have close access to government officials.

Although politicians must take these petitions seriously, they may have a good reason to challenge their validity. The question is about authentication - the process of making sure the person signing the online petition is who they say they are.

Authentication of online petitions require very little information - the email address, forename, surname and postcode. Anyone with access to an electoral role and also possessing a large volume of email addresses could possibly build an automated process to generate thousands of signatures on an online website.

So politicians cannot be sure that all of the signatures are true. They could be falsified by corporate or political interest groups. They could also be from people living outside your country, as the internet is truly global, people living outside your country can authenticate if they know a valid postcode in your country.

Personal authentication to an individual is certainly within the realms of possibility, with the use of biometric fingerprinting, RFID chips and identity cards. However, there are fundamental human rights and personal freedoms that need to be addressed.

Until we can find a solution to authentication that does not compromise our rights to privacy and freedom, online petitions may not be as effective as we hope.

Thursday, 21 February 2013

A morning with Talend's BIG DATA

On the 20th of February, Manchester played host to Talend's Big Data roadshow. It promised to introduce people to the basics of big data and showcase Talend's new versions of their open source software.

The attendees were from varying industries throughout the North West of England. I was quite surprised to see some familiar faces from previous projects. It's a small world!

Ben Bryant was our technical presenter who ably took us through the usual introductions, and explained Talend's history as a leading supplier of open-source Data Integration, Data Quality and Master Data Management software. I was quite familiar with their products, but had not seen their big data solutions in action.

Once the basics were out of the way, Ben went on to explain the fundamentals of big data and how Talend's tools integrated into the current big data infrastructure. We were treated to a demonstration of an active set of slave and master servers, with Talend extracting, loading and analysing the data. An example of profiling using was also shown. We were introduced to a Hadoop infrastructure, Pig Latin and Hive. All in plain English, refreshingly free of jargon and acronym abuse

My impression from the presentation is that Talend have a clear vision of the future, and that Big Data has a part to play in it. Their goal appears to be to make their tools as simple to use with this new way of structuring and processing data. Education of customers has to be key, for it is the users who will find new and exciting ways to innovate using their technology. These innovations will find their way into future releases of Talend through their community of developers.

What I saw, did not inspire me with an immediate need to build a big data solution. For this is still an emerging technology. However, it did say to me, "Hey, Rich. Big data is simpler than you think... Next time someone talks to you about unstructured data like emails, social media, images etc... Why not have a go?"

Monday, 18 February 2013

A weekend with Talend

I am constantly surprised by the wealth of open-source software that is available for use. So when flu struck myself and my family this weekend, which put paid to our plans, I decided to do some software evaluation. Talend have been occupying the 'visionary' side of Gartner's Magic Quadrant Data Quality software for some time now.

Last year, I evaluated their Open Profiler, and found it useful, if a little clunky. But that is the tip of the iceberg. They also provide a Data Integration tool, which I decided to have a go with.

The version I was using is version 5.2.1 Installing was simple, I merely extracted the zip file that you download from the Talend's website, and selected the file that would run the software. There are Linux, OS-X, Solaris and Windows options that all come packaged and ready to run in 32 or 64 bit options. The program runs on the well-known eclipse graphical user interface, so it depends upon having Java installed on your machine.

Once opened, the program took time to fully load all of the tools. But when selecting them from the panel on the right, I could see why. There is just about everything you need to be a one-man data integration specialist. It contains enough JDBC connectors to enable you to connect to just about any database.

The database I chose was an old instance of MySQL that I had on my computer for some time. I set up some dummy data and dived right in.

The whole package is just the right mix between simplicity and configuration. Extracting, parsing, joining, transforming data is very straightforward. The way the program deals with type 1, type 2 and type 3 aspects of slowly-changing-dimensions is fantastic. That function alone makes it an outstanding piece of work that should save you a huge amount of development time. All of the modular jobs have the ability to export their results and the details of any errors back into the database of your choice, or perhaps into files. This means you can produce comprehensive management information on the efficiency of your processes.

Once you have built your DI jobs, you can export them as self-contained programs that can be deployed within the package or platform of your choice. As long as javascript is enabled, the jobs will run. Before the weekend was out, I had a fully functioning, scheduled data warehouse, with a comprehensive detail layer and a presentation layer of summarised MI ready to plug an OLAP portal into.

There are some limitations to this package. If you are to work with a group of analysts, a shared repository is vital. However, you have to get the enterprise version for that, and that doesn't come cheap. But leaving that aside, Talend have to be congratulated for putting together quite an impressive piece of Data Integration software.. Honestly, just I can't believe it's free.

Next week I will be attending Talend's roadshow to see their new developments in the data science discipline of 'Big Data'. I will let you know how it goes.

Saturday, 9 February 2013

Data lineage - a cautionary tale

Recently, laboratories in Ireland discovered that some of their processed beef contained horse meat. The public were outraged, and the Food Standards Agency insisted that all retail sold processed beef in the UK was DNA tested for horse meat. Some high profile branded processed beef products have been found to contain 100% horse meat.

Now the general public are very concerned that they have been eating food that could have been contaminated with chemicals that are used in the rearing of horses.

A senior politician was quoted to say, "We need to know the farmer and the meat processor."

It seems that the modern processing of meat has become a very complicated business, with different parts of animals being moved from one company to another. No-one truly knows where their processed meat comes from.

You may be surprised to hear that there are many companies who deal with data in a similar way to the meat processing industry. They may know where the data is manufactured, and where the results appear in reports, but it is the processing in the middle that they don't understand.

Data may arrive in a database, then get extracted, parsed, standardised and moved from one mart to another. It may be summarised and moved into multiple spreadsheets, where adjustments are manually made and then the data is re-extracted into other systems before finally finding it's way into a report. The full map of systems, processes and departments involved in the processing chain may not be known by just one person in the organisation. It is also unlikely that any of it is written down!

The understanding of how data is manufactured, processed, stored and used is called 'data lineage'. The financial industry has already addressed the problem of companies not knowing their data lineage by the EU directive Solvency II. Although it is not in force yet, the value in understanding data lineage is now becoming law.

If your company is large, tracking your lineage may be an expensive business. Certainly, the software is very expensive. Such costs may be hard to justify in the present financial climate, but doing it now, on your own terms, is far cheaper than waiting for an angry public and government legislation to force you to do it.

Sunday, 3 February 2013

BI Centralisation - The Challenges

Perhaps your company has acquired a number of other businesses. It could still be suffering from the remnants of older ways of working. But a quick look around many companies will show that management information is being generated in many different areas, with varying levels of accuracy.

This can cause you a great deal of problems with multiple, conflicting versions of the same measure being manufactured throughout the organisation.

Many organisations are looking to build Business Intelligence competency centres, by pooling resources, systems and processes into one area. This has many advantages. But you may encounter some fierce opposition to your plans. Here are the top reasons why people will oppose your plans:

1. Exclusivity
People like to manufacture their own MI because it gives them a first look at the figures before everyone else. So if you're in the sales department and you manufacture the sales figures, you see them first. You can start thinking up excuses as to why you haven't hit your targets way before anyone else knows about the results.

2. The illusion of control
I'm not sure why this happens, but departments like to control their MI, because somewhere in their heads, it implies that they can control the business itself. Manufacturing your own MI only brings benefits if you do not have a data quality department.

3. Analysts have become too powerful
Very often the MI analysts know more, and make more business decisions than the managers. If their analysts were sucked into a centralised department, or made redundant, the manager would lose his/her competitive advantage.

4. You can bury bad news
Once you control your MI, there is a great temptation to only publish the data that supports the story that you want to tell. If any of your data contradicts the narrative, then it's just not important and often left out. Part of the climate sceptics arguments is that climate scientists are accused of omitting the results that do not fit with their hypotheses.

Centralising all management information functions brings a lot of important synergies to medium and large companies. But more importantly, it takes the figures out of the control of the departments who have a vested interest in their results. As a result, conflicts of interest occur less, and the data is queried and presented fairly.

To ensure this happens, it is vital that a Business Intelligence Competency Centre should almost be running as a separate entity from the rest of the organisation, and therefore free of the political control of vested interests from other parts of the organisation.