06 Dec

Data Science and the day of the week

In South Africa we have a lot of public holidays. We like celebrating things. This leads to very interesting behaviour around these public holidays. For instance, the dates for Easter are hugely important for retail. If Easter is late it falls far enough away from the March holidays then the amount of leave is dispersed over a two month period. If Easter is early then people take both the March and Easter holidays as one and can be away for 20-25 days with minimal impact to leave days taken. Also, if your year end is March then if you have an early easter one year and late the next you may have no easters in a financial year. If you have a late Easter and an early easter than you may have two in one year. South Africans also like the sea. This causes a very interesting problem: all the people may invade the coastal towns for an extended period or not go at all depending on how these holidays fall. If you are a  retailer in then the early easter kills trade in Johannesburg and boosts it in the coastal regions.

In machine learning the concept of categorizing days is simple, is it Sunday? Yes, great then you’re 1 and everyone else is zero. Is it Monday? Then monday is 1 and everything else is zero. This approach has a problem: what is the difference between a Sunday and a Monday? Well, you work on one and not on the other. Right? Sort of. What if the Monday is a public holiday? Even Google Maps says it doesn’t take into account public holidays. Whoops. Here we have another issue: what if the Tuesday is the public holiday? Well, then the Monday may be flying at half mast and is thus not a normal Monday but not quite a public holiday. Also, the Wednesday, Thursday and Friday are also not normal.

I had a quick pass at this problem a while back and came up with a reasonable (and I’ll agree not perfect) way to measure the day rather than categorise it. This means I can infer traffic and sales based on what type of day it is rather than the day of the week. The model works for the Monday to Friday 8-5 workers, so you’d need to check if that works for you.

So, firstly we calculate whether it is a work day or not. It is a work day if it is monday to friday and not a public holiday. It is a public holiday if it is a gazetted public holiday on a fixed date (New Years day, Youth Day etc.) or it is the Friday before Easter Sunday or the Monday after Easter Sunday as allocated by the Catholic Church. This gives you a good distribution of ones and zeros. We’re already better than Google because we can say public holiday or not. Next, people are more likely to take a Monday off if the preceding Friday is a public holiday and almost guaranteed to the Tuesday is a public holiday. It’s like we’ll forget all our work if we work on those days. If the Wednesday is a public holiday some people will take the first two days (Monday and Tuesday) and some will take the last two days (Thursday and Friday). So what we can now do is score a day as follows (and the values are arbitrary but consistent):

If the day is a non-work day, add 1. If day day directly preceding it or succeeding it is a non-work day add 1/1.25 (arbitrary divisor). Continue this boundary calculation to 5 days either side (Saturday will always be next to Sunday and vice versa so the week in the middle of a normal set of three weeks will have a Saturday and Sunday score of 1.8 and a weekday score below 1.8. If there are public holidays around the weekday score creeps up, as does the weekend score. This means that you can start predicting performance against nobody is going to be around Monday and everyone is at work Wednesday by using a continuous measure for the kind of day rather than a week day. Here is a screen shot of the easter period 2015-2035 to show how this can be done, (2018 and 2027 are going to have long early breaks) and why April is almost always a write-off in South Africa.

Part 2:

So I accidentally put this in BigQuery and Tableau (well, because I could) and started slicing the data around. I noticed the best way to explain to the English sales folk who hound us in April that April is a throwaway month. Here is the same data aggregated for 40 years showing the worst case leave scenario. Keep in mind these can be used for sales (FMCG for the coastal regions will spike over leave periods, FMCG for inland will tend to dip) and various other scenarios.

23 Sep

Markov Chains

So I have been doing a fair amount of work on Markov Chains.  They are really cool for modelling of random process. Earlier this year I wrote a program that would take a book (it was the “Starship Titanic” by Douglas Adams) and convert it into a JSON stored Markov chain. All it did is said if the current word is this, the next word will probably be that. The really interesting component here is that if you fed it enough text and then ignored/hashed the words, you would be left with a data structure that would be a fingerprint of the author/language/time period which could then be matched to other bodies of work/languages/time periods. Think about it . . . you would be able to identify text’s authors or date it was written or the language purely based on the structure and not necessarily the words.

I quickly realized that there are caveats and limitations to standard Markov chains.

Firstly, as languages go the next would that appears may be very different based on context. I managed to get readable text between 5 and 10% of the time with what I will now call a first order Markov chain. What happens if you increase the order? readability goes to 30 to 50%. The trick is that the map becomes an order of magnitude more complicated because now your chain requires the current word AND the previous word to know the probability of the next word. This is extensible to 3rd and 4th and so on orders as well. This could make your Markov chain into a neural net with the addition of the next part.

Secondly, if you are using Markov chains as neural nets, you need to vary your probabilities based on either time decay or some other feedback function. This allows the chains to learn positive feedback behaviours and start to ignore negative feedback. There is a catch to this as well, your network has to consider 0 probability links as being linked, where as standard Markov chains allow for 0 probability links to be ignored (or effectively not there). This increases the storage space required is it becomes a Pn problem. n nodes has n x n connections (remember that a Markov chain has direction) and can loop to itself. This scenario does not lend itself to JSON storage which is essentially an efficient sparse matrix but rather to a dense matrix storage method. A second order of this would require (n x n) x n connections so would not lend itself to higher order chains, each order would add a dimension. 1st order would be 2 dimensions, 2nd order 3, 3rd order 4 etc.

The second case I have not yet done work one as I have no use-case at this time, but I’m sure I’ll get around to running a few tests in the few months.

15 Sep

Internet of things and the Raspberry Pi

I had an interesting discussion today that I got really impassioned about. It happens sometimes. It was around the Raspberry Pi. Was it supposed to just be an educational tool? Maybe. Maybe not.

So there is a little project that I have built most of the components for and will probably put together over Christmas, but first a bit of background.

At the end of May my phone was stolen through my bedroom window. Really not a great experience. It was a good learning curve and I lost nothing save for one voice recording and the phone. Everything else was in the cloud. It also got me to thinking about how I would build an alarm system from the ground up. I did some work with D-Latches and Shift Registers a while back so I was confident I could build the electronic side that would need to read 32 inputs or control 32 outputs using a serial bus. This would have turned my Arduino UNO into a PLC which is nifty in its self. How would I control it? What transport mechanism would be reliable enough to deliver messages through it? It had to be standard stuff and have good encryption. What messaging service supports reliable delivery of short messages anywhere in the world over the internet and support bi-directional communication? Any guesses?

Well, as time passed, we upgraded the electric fence, body corporate have installed passive sensors along the perimeter so my alarm idea went out the window.

Then we get some tropical fish. How do we monitor the temperature? Could we control it over the internet?

Each of the components I am about to describe I have built. I have put all the building blocks together but that is a matter of time.

  1. I built an interface from a PT100 temperature probe to and Arduino Uno, really easy and really accurate. One LM324 (a bit of overkill but I couldn’t get the LM741 to behave) and 4 resistors. Check the Art of Electronics for a differential Op-Amp circuit This output a conditioned signal that allowed me to read the temperature within a few tenths of a degree. There is a trick I learned long ago, you can build very expensive analogue electronics to get the right signal or you do all the processing once digitized, guess which one is cheaper these days?
  2. You can use the I<sup>2</sup>C interface between a Raspberry Pi and an Arduino Uno. The Arduino is nice because of the build in 10 bit Analog to Digital converter. The Pi is nice because of the next few pieces.
  3. The Pi supports Wifi and 3G. Mine is Wifi enabled using a R89 dongle. So my Pi connects over the internet regularly for updates etc.
  4. A really great piece I built was a Raspberry Pi Twitter bot that could send and respond to direct messages. You can send and received commands to and from the Pi over twitter.

Now the really cool reason I settled on twitter as the message mechanism is two fold:

  1. Twitter allows searching, automatically archives data (think logging), time stamps and geo-codes (if enabled).
  2. Twitter integrates into If This Then That which supports anything else web based. If you don’t know about this, IFTTT will change your world, you don’t even need to know how to program.

The second item is where this gets really interesting. You can start to time base your commands, tweets can trigger SMS messages or emails. You can geofence your phone and get your Pi to do things.

The essence of this project was to solve a simple problem but it illustrates the power of the internet of things. Imagine every house about to tweet? Could you stop a crime wave if you had enough location information? Could you include the police on tweets for faster response? If you used this to measure a fish tank could you provide live data sets for ichthyologists the same way FitBit does for humans? This is the source of the real-world data avalanche.  This is the data that describes the world we live in. Why should financial institutions have all the fun?

One thing is certain, we live in an interesting time where you can follow spacecraft that have landed on comets on twitter and drones deliver ice-creams. Could your pet fish order their next meal by drone? It is no longer science fiction.

16 Jul

Big Data and Banking

I went to a very interesting talk by Dr Usama Fayyad last night at GIBS. I was a bit out of place as a technologist in a room full of actuaries and MBAs but the talk was interesting.  Aside from the big data stuff which was fairly generic and if you have worked a bit with Hadoop, RapidMiner, R and the like not very cutting edge, it was great to see comparisons in industry uptake. The online cows presentation was awesome, AI meets AI. And you have to hear the presso to get that joke.


The key takeaways I got from the meeting were that most companies have no idea what they are sitting on and most companies are behind in their implementations.

My favourite quote for the day was “we are providing painkillers, not vitamins, we fix broken things, not make things better.” Fix a problem to make the business work before you make things better. Caveat: there will always be something to fix.

05 May

The Beginning of the Big Data Journey

I think everyone would agree that big data is the new silver bullet. There are a lot of promises out there on what this will do for you. As I start on this journey myself I would like to share my thoughts on this. I reserve the right to be wrong and change my mind as more information becomes available.

First, a little history. I started down this path about two years ago when I wanted to consolidate some spatial information to assist our sales teams in more accurately driving business within certain areas. This means that the bulk of my big data exposure thus far is around spatial information. I also think that going forward spatial information is the most powerful part of big data. My reason is simple, what works with my data correlation in one region may not in another region for any number of reasons (income, geo-political, taboos and culture). I did a lot of research into GIS (Geographic Information Systems) and how to aggregate and consolidate data. I checked on PostGIS, ArcGIS, GeoServer and Google Earth of course. Along this road I came to two conclusions:

  1. Everyone has data and most of it is big and dirty
  2. Big data doesn’t miraculously give you answers

So lets have a look at number 1. I say that everyone has big data? Yip. I ran a business for 7 years. I accumulated a paltry 1.5 MB of Financial data, 2.6 GB of mail and about 200 MB of documentation. Not bad for a mostly one man show. Take an SME, 10 people say, and take their call history metadata, who called who when (you know, the stuff the NSA has been keeping on everyone), that could give you 10000 Call Data Records (CDRs). Take an ISP and the picture gets messy really quickly, VoIP metadata, network performance stats, outage stats, throughput stats. There are two levels where this gets powerful, the first is analysis across disparate systems in a single organization. The second is where it gets really cool, aggregation of data from different organizations in different fields. What does the agriculture distribution in the northern cape do to the colour of second hand cars?

The second item I think flies in the face of most data scientists I have chatted to and read about. Douglas Adams is the father of a lot of pop science but interestingly he saw this coming, the answer to life is 42 but what does it mean? Maybe we didn’t know the question. Big data needs to be asked a question before you can get any answer. The question you ask can be wrong and you must fully expect it to change as you refine it, but you will be asking an initial question. This is your first pass at the data. It will begin with boring information. Santam did a really interesting thing with a recent radio add. “On Wednesday our stats show that you are more likely to . . ” really looks like big data

05 May

Predictive Analytics

So my journey has started out in earnest. I am currently attending the Gigaom Predictive Analytics webinar. I think the biggest take away is that the market in general expects structured historical data mining is already happening. The scary thing is that this is not happening in South Africa yet, or more specifically, not in any meaningful scale.

The other main revelation is that the most successful companies in big data rely on end user generated content. Your end user may be the facebooker or youtuber but more interestingly could be the car dealership updating live vehicle pricing. This means that besides the big names like Google and Amazon, are really good on the single person interaction, but potential leaders will be a lot more low key like Sage/Pastel, SAP and the like. I think one of the big untapped markets here is the move to low cost cloud solutions for SMME. Many ERP systems are doing this but the key is the cross industry analytics that can be done on these platforms.

Predictive analytics is not about getting it right, but getting it a little more right than the next guy. You don’t have to run faster than the bear, just faster than the next guy. The growth in this is going to be huge and anyone not already playing in this area will be unable to compete. Big data is the next disruptive technology and the analytics, predictive or otherwise is the accelerator.

One of the examples in the webinar was cell tower placement under operational efficiencies. This ties back to spatial analytics which I think is the core around operational efficiencies in business. It always has been, remember: “location location location”.

Chat soon

05 May

Spatial Analytics

I’ve written before around the key value aspects of data being spatial and temporal. I have been doing a lot of reading on some of the spatial side of structured data.

To start with I wanted a basic understanding of GIS. This is the core of most of the cutting edge spatial analytics available and probably way out of the reach of big data. It has been interesting because a lot of the analytics are discrete (point information forming a grid) but some is interpolated smooth information. The other aspect of this is kernelling and clustering which is key to big data.
Follow this:
Map Step – give the the location of each widget sold (widget, (location,1))
Reduce Step – Sum widget sales per area

Now I know where each widget was sold. This allows many aspects of the business to be optimised:

  1. Logistics – how many widgets should I be moving to each of these areas?
  2. Sales Planning – if widgets are not selling well can we run specials?
  3. Resourcing – do certain widgets sell better in certain areas?
  4. Bundling – are people buying two different widgets frequently in one area and not in another? Can this be exploited?
  5. Layout – are there sales gaps (zero widget sales) between two high use areas?

One of the fundamentals around spatial analytics revolves around the ability to visualize this effectively. I have some faith in computers but they need to be watched. While I’m on this journey I will need to see some thing to believe them. So this gap in the last point is the core of big data, the rest is pretty much the analysis of structured data you already have. You know where your stores are and what your sales are. Even the smallest enterprise has this data. The clincher to take this to the next level is to map my data about against a few more datasets:

  1. SARS demographics – What is my disposable income (higher tax brackets) in the area where widgets are selling well or badly?
  2. Census data – are there gaps where there is population?
  3. Climate data – do certain widgets sell better in warmer areas?

Data is king, questions the Queen but correlation is the Ace

05 May

Structured Data (That which you already have)

Most companies are sitting on veritable gold mines of data. A lot of this data is a waste by product of daily operations. In fact, the bulk of the data is waste data to that particular organization. The big trick is to find who needs that data, package it and resell it, even if you resell it back to your organization.

I was chatting to a bank around forex trading recently. We were discussing my move into big data. The usual questions came up like “What is this big data thing?” “How will it help us?”. I decided to approach this from the structured data side. Here in South Africa we have FICA regulations (Financial Intelligence Centre Act) which dictates that in order for a bank to do business with an individual they must have certified proof of residence (electricity bill, rental agreements etc.) This bank was collecting this information under legal requirement but was not doing anything with it. My suggestion was to take this data and spatially tag it (convert it into rough GPS coordinates, accuracy is not critical) and correlate that information with the branches for which they already have the coordinates for. Using some basic techniques they could more efficiently decide where to open and close branches based on client density. There is a caveat that you get the end user’s home address and not necessarily their work address which may be more convenient. It starts giving you general information about your geographic spread.

I have already built basic system to do this type of analysis on data and will be setting up some tests to demo to clients. If you would like a demo, please drop me a mail at info_at_idix.co.za


05 May

Big Data and the Ease of use

I have just been playing around with Rapid Minder (www.rapidminer.com) and was looking at how efficient it is at pulling useful stats from some TMS data I had lying around. I must say, wow. This tool can be applied to any little piece of analytics you need. So Rapidminer purchased Radoop last month for an undisclosed amount. This means that you can take the Rapidminer Studio and develop your anaylsis to deploy onto Cloudera (or Hadoop flavor of your choice) via Radoop. The mind bogles. This is really cool tech.