05 May

The Beginning of the Big Data Journey

I think everyone would agree that big data is the new silver bullet. There are a lot of promises out there on what this will do for you. As I start on this journey myself I would like to share my thoughts on this. I reserve the right to be wrong and change my mind as more information becomes available.

First, a little history. I started down this path about two years ago when I wanted to consolidate some spatial information to assist our sales teams in more accurately driving business within certain areas. This means that the bulk of my big data exposure thus far is around spatial information. I also think that going forward spatial information is the most powerful part of big data. My reason is simple, what works with my data correlation in one region may not in another region for any number of reasons (income, geo-political, taboos and culture). I did a lot of research into GIS (Geographic Information Systems) and how to aggregate and consolidate data. I checked on PostGIS, ArcGIS, GeoServer and Google Earth of course. Along this road I came to two conclusions:

  1. Everyone has data and most of it is big and dirty
  2. Big data doesn’t miraculously give you answers

So lets have a look at number 1. I say that everyone has big data? Yip. I ran a business for 7 years. I accumulated a paltry 1.5 MB of Financial data, 2.6 GB of mail and about 200 MB of documentation. Not bad for a mostly one man show. Take an SME, 10 people say, and take their call history metadata, who called who when (you know, the stuff the NSA has been keeping on everyone), that could give you 10000 Call Data Records (CDRs). Take an ISP and the picture gets messy really quickly, VoIP metadata, network performance stats, outage stats, throughput stats. There are two levels where this gets powerful, the first is analysis across disparate systems in a single organization. The second is where it gets really cool, aggregation of data from different organizations in different fields. What does the agriculture distribution in the northern cape do to the colour of second hand cars?

The second item I think flies in the face of most data scientists I have chatted to and read about. Douglas Adams is the father of a lot of pop science but interestingly he saw this coming, the answer to life is 42 but what does it mean? Maybe we didn’t know the question. Big data needs to be asked a question before you can get any answer. The question you ask can be wrong and you must fully expect it to change as you refine it, but you will be asking an initial question. This is your first pass at the data. It will begin with boring information. Santam did a really interesting thing with a recent radio add. “On Wednesday our stats show that you are more likely to . . ” really looks like big data