This article has been moved here.
I think everyone would agree that big data is the new silver bullet. There are a lot of promises out there on what this will do for you. As I start on this journey myself I would like to share my thoughts on this. I reserve the right to be wrong and change my mind as more information becomes available.
First, a little history. I started down this path about two years ago when I wanted to consolidate some spatial information to assist our sales teams in more accurately driving business within certain areas. This means that the bulk of my big data exposure thus far is around spatial information. I also think that going forward spatial information is the most powerful part of big data. My reason is simple, what works with my data correlation in one region may not in another region for any number of reasons (income, geo-political, taboos and culture). I did a lot of research into GIS (Geographic Information Systems) and how to aggregate and consolidate data. I checked on PostGIS, ArcGIS, GeoServer and Google Earth of course. Along this road I came to two conclusions:
- Everyone has data and most of it is big and dirty
- Big data doesn’t miraculously give you answers
So lets have a look at number 1. I say that everyone has big data? Yip. I ran a business for 7 years. I accumulated a paltry 1.5 MB of Financial data, 2.6 GB of mail and about 200 MB of documentation. Not bad for a mostly one man show. Take an SME, 10 people say, and take their call history metadata, who called who when (you know, the stuff the NSA has been keeping on everyone), that could give you 10000 Call Data Records (CDRs). Take an ISP and the picture gets messy really quickly, VoIP metadata, network performance stats, outage stats, throughput stats. There are two levels where this gets powerful, the first is analysis across disparate systems in a single organization. The second is where it gets really cool, aggregation of data from different organizations in different fields. What does the agriculture distribution in the northern cape do to the colour of second hand cars?
The second item I think flies in the face of most data scientists I have chatted to and read about. Douglas Adams is the father of a lot of pop science but interestingly he saw this coming, the answer to life is 42 but what does it mean? Maybe we didn’t know the question. Big data needs to be asked a question before you can get any answer. The question you ask can be wrong and you must fully expect it to change as you refine it, but you will be asking an initial question. This is your first pass at the data. It will begin with boring information. Santam did a really interesting thing with a recent radio add. “On Wednesday our stats show that you are more likely to . . ” really looks like big data
So my journey has started out in earnest. I am currently attending the Gigaom Predictive Analytics webinar. I think the biggest take away is that the market in general expects structured historical data mining is already happening. The scary thing is that this is not happening in South Africa yet, or more specifically, not in any meaningful scale.
The other main revelation is that the most successful companies in big data rely on end user generated content. Your end user may be the facebooker or youtuber but more interestingly could be the car dealership updating live vehicle pricing. This means that besides the big names like Google and Amazon, are really good on the single person interaction, but potential leaders will be a lot more low key like Sage/Pastel, SAP and the like. I think one of the big untapped markets here is the move to low cost cloud solutions for SMME. Many ERP systems are doing this but the key is the cross industry analytics that can be done on these platforms.
Predictive analytics is not about getting it right, but getting it a little more right than the next guy. You don’t have to run faster than the bear, just faster than the next guy. The growth in this is going to be huge and anyone not already playing in this area will be unable to compete. Big data is the next disruptive technology and the analytics, predictive or otherwise is the accelerator.
One of the examples in the webinar was cell tower placement under operational efficiencies. This ties back to spatial analytics which I think is the core around operational efficiencies in business. It always has been, remember: “location location location”.
I’ve written before around the key value aspects of data being spatial and temporal. I have been doing a lot of reading on some of the spatial side of structured data.
To start with I wanted a basic understanding of GIS. This is the core of most of the cutting edge spatial analytics available and probably way out of the reach of big data. It has been interesting because a lot of the analytics are discrete (point information forming a grid) but some is interpolated smooth information. The other aspect of this is kernelling and clustering which is key to big data.
Map Step – give the the location of each widget sold (widget, (location,1))
Reduce Step – Sum widget sales per area
Now I know where each widget was sold. This allows many aspects of the business to be optimised:
- Logistics – how many widgets should I be moving to each of these areas?
- Sales Planning – if widgets are not selling well can we run specials?
- Resourcing – do certain widgets sell better in certain areas?
- Bundling – are people buying two different widgets frequently in one area and not in another? Can this be exploited?
- Layout – are there sales gaps (zero widget sales) between two high use areas?
One of the fundamentals around spatial analytics revolves around the ability to visualize this effectively. I have some faith in computers but they need to be watched. While I’m on this journey I will need to see some thing to believe them. So this gap in the last point is the core of big data, the rest is pretty much the analysis of structured data you already have. You know where your stores are and what your sales are. Even the smallest enterprise has this data. The clincher to take this to the next level is to map my data about against a few more datasets:
- SARS demographics – What is my disposable income (higher tax brackets) in the area where widgets are selling well or badly?
- Census data – are there gaps where there is population?
- Climate data – do certain widgets sell better in warmer areas?
Data is king, questions the Queen but correlation is the Ace
Most companies are sitting on veritable gold mines of data. A lot of this data is a waste by product of daily operations. In fact, the bulk of the data is waste data to that particular organization. The big trick is to find who needs that data, package it and resell it, even if you resell it back to your organization.
I was chatting to a bank around forex trading recently. We were discussing my move into big data. The usual questions came up like “What is this big data thing?” “How will it help us?”. I decided to approach this from the structured data side. Here in South Africa we have FICA regulations (Financial Intelligence Centre Act) which dictates that in order for a bank to do business with an individual they must have certified proof of residence (electricity bill, rental agreements etc.) This bank was collecting this information under legal requirement but was not doing anything with it. My suggestion was to take this data and spatially tag it (convert it into rough GPS coordinates, accuracy is not critical) and correlate that information with the branches for which they already have the coordinates for. Using some basic techniques they could more efficiently decide where to open and close branches based on client density. There is a caveat that you get the end user’s home address and not necessarily their work address which may be more convenient. It starts giving you general information about your geographic spread.
I have already built basic system to do this type of analysis on data and will be setting up some tests to demo to clients. If you would like a demo, please drop me a mail at info_at_idix.co.za
I have just been playing around with Rapid Minder (www.rapidminer.com) and was looking at how efficient it is at pulling useful stats from some TMS data I had lying around. I must say, wow. This tool can be applied to any little piece of analytics you need. So Rapidminer purchased Radoop last month for an undisclosed amount. This means that you can take the Rapidminer Studio and develop your anaylsis to deploy onto Cloudera (or Hadoop flavor of your choice) via Radoop. The mind bogles. This is really cool tech.