06 Dec

Data Science and the day of the week

In South Africa we have a lot of public holidays. We like celebrating things. This leads to very interesting behaviour around these public holidays. For instance, the dates for Easter are hugely important for retail. If Easter is late it falls far enough away from the March holidays then the amount of leave is dispersed over a two month period. If Easter is early then people take both the March and Easter holidays as one and can be away for 20-25 days with minimal impact to leave days taken. Also, if your year end is March then if you have an early easter one year and late the next you may have no easters in a financial year. If you have a late Easter and an early easter than you may have two in one year. South Africans also like the sea. This causes a very interesting problem: all the people may invade the coastal towns for an extended period or not go at all depending on how these holidays fall. If you are a  retailer in then the early easter kills trade in Johannesburg and boosts it in the coastal regions.

In machine learning the concept of categorizing days is simple, is it Sunday? Yes, great then you’re 1 and everyone else is zero. Is it Monday? Then monday is 1 and everything else is zero. This approach has a problem: what is the difference between a Sunday and a Monday? Well, you work on one and not on the other. Right? Sort of. What if the Monday is a public holiday? Even Google Maps says it doesn’t take into account public holidays. Whoops. Here we have another issue: what if the Tuesday is the public holiday? Well, then the Monday may be flying at half mast and is thus not a normal Monday but not quite a public holiday. Also, the Wednesday, Thursday and Friday are also not normal.

I had a quick pass at this problem a while back and came up with a reasonable (and I’ll agree not perfect) way to measure the day rather than categorise it. This means I can infer traffic and sales based on what type of day it is rather than the day of the week. The model works for the Monday to Friday 8-5 workers, so you’d need to check if that works for you.

So, firstly we calculate whether it is a work day or not. It is a work day if it is monday to friday and not a public holiday. It is a public holiday if it is a gazetted public holiday on a fixed date (New Years day, Youth Day etc.) or it is the Friday before Easter Sunday or the Monday after Easter Sunday as allocated by the Catholic Church. This gives you a good distribution of ones and zeros. We’re already better than Google because we can say public holiday or not. Next, people are more likely to take a Monday off if the preceding Friday is a public holiday and almost guaranteed to the Tuesday is a public holiday. It’s like we’ll forget all our work if we work on those days. If the Wednesday is a public holiday some people will take the first two days (Monday and Tuesday) and some will take the last two days (Thursday and Friday). So what we can now do is score a day as follows (and the values are arbitrary but consistent):

If the day is a non-work day, add 1. If day day directly preceding it or succeeding it is a non-work day add 1/1.25 (arbitrary divisor). Continue this boundary calculation to 5 days either side (Saturday will always be next to Sunday and vice versa so the week in the middle of a normal set of three weeks will have a Saturday and Sunday score of 1.8 and a weekday score below 1.8. If there are public holidays around the weekday score creeps up, as does the weekend score. This means that you can start predicting performance against nobody is going to be around Monday and everyone is at work Wednesday by using a continuous measure for the kind of day rather than a week day. Here is a screen shot of the easter period 2015-2035 to show how this can be done, (2018 and 2027 are going to have long early breaks) and why April is almost always a write-off in South Africa.

Part 2:

So I accidentally put this in BigQuery and Tableau (well, because I could) and started slicing the data around. I noticed the best way to explain to the English sales folk who hound us in April that April is a throwaway month. Here is the same data aggregated for 40 years showing the worst case leave scenario. Keep in mind these can be used for sales (FMCG for the coastal regions will spike over leave periods, FMCG for inland will tend to dip) and various other scenarios.