Big Data: the miracle cure myth
In 2001 a Gartner paper1 defined three challenges to data analytics- the amount of data available, the variety of data being collected, and the turn-round speed required by industry. The increasing use of web and electronic communication in particular meant that the amount and range of data was going through an exponential increase, and the speed to market changes was also increasing which meant that business wanted to make decisions in increasing frequencies. In short- the advent of the digital age was going to provide a bonanza of analytic opportunities, as long as the increasing size and width of data could be handled, and analysis completed quickly enough.
Out of this has been born the phrase Big Data- a phrase which refers to any data which cannot be easily housed and manipulated in conventional relational databases. There are some erroneously inferred extensions to this definition which is currently looking to influence the analytic strategy for companies of all sizes, with trade journals being particularly culpable.
1) Big Data is always good (The Future of Big Data- Elon Uni, 2012; Big Data- the good, the bad and the savvy- Soprabanking, 2012; ‘Big Data- making complex things simple’- MIT 2013 etc.)
2) If you don’t invest in Big Data yourself you will be Left Behind (Gartner- numerous articles, ‘Non Adopters in Big Data’- Big Data Weekly 2013; ‘Why Big Data is the New Competitive Advantage’- Ivey Business Journal 2013 etc.
3) The huge size of data now means you can no longer visualise this new territory, and as such understanding it becomes irrelevant- ‘why’ gives way to ‘what’ and correlations become the source of actionable insight- not causality. The world of analytics is changing radically (7).
4) There will quickly be a shortage of analysts with the talent and skills needed to extract actionable insight from this new area (McKinsey 20132, SiliconAngle.com 20133, etc.)
The inference is simple- Big Data will allow you to make better informed decisions (without the need to actually understand the data)-in a quicker fashion (but will cost you more to do it- hardware, software and analysts are all required)- but if you don’t adopt a Big Data strategy you are automatically a dinosaur and we know what happened to them.
Fortunately business leaders are not so easily swayed by the bombardment of rhetoric- despite comments to the contrary by the like of Gartner4– who admittedly as a leading IT and analytics consultant have a vested interest in pushing a rose-tinted view. They have stated “42% of IT leaders have either invested in Big Data or intend to invest this year”. Compare this with the stark warnings from SAS5 (another IT and analytics consultancy) who report just 12% are likely to adopt. It must be questioned why these figures are so different- possibly Gartner and SAS are using different ‘Big Data’?!
There are some companies which have reportedly benefitted from investing in Big Data- Google, the Otto Group (which owns online catalogue companies Freemans and Rattans), Nationwide banking group- these have all touted success through adopting Big Data analysis. It is also fair to say that Big Data is not a bad concept- but it has to be the right one for each subscriber rather than a ‘one-size-fits-all’ approach. And what we don’t get to see is the cost / benefit trade off. Big Data is a very complicated business- and that means it is also expensive.
The problem for a lot of companies is very much chicken-and-egg syndrome: how do you know how much value you can extract by examining and manipulating all your available data sources until you have done so? Forbes published an article entitled “3 Steps To Incorporate Big Data Into Your Small Business” (2013)where they expressly advise installing all the kit first before looking at what benefits you may get out of it! I suggest a more considered approach should be taken before making the required investment.
Let’s look at the Big Data challenges again.
1) “Big Data will allow you to work with much more data, and therefore get much more actionable insight”. This assumption is unproven as a universal truth. The rationale behind sampling is by taking a small, random proportion of the whole you can extrapolate the results to be indicative of the mean of everything in the universal group. This means you need less data to get the same results. Big Data has an inherent assumption that N = All (6) which is untrue- it can only deal with items for which data exists, and for which data can be quantified in some manner. Social media activity, web browsing activity, demographic information, and meteorological information- all this can be ‘datafied’ so it can all be counted. But clearly this ‘datafication’ may not describe everything in a universal set- for example by tracking social media you automatically exclude the behaviours of anyone who does not leave a digital footprint.
The danger is people will lose sight of the fact that the data sets will always be incomplete, and therefore will have some degree of bias. At least with conscious sampling this realisation is understood. To illustrate this point: during the Second World War Abraham Wald was tasked with reviewing damaged planes coming back from sorties over Germany. He had to review the damage of the planes to see which areas must be protected even more.
Abraham found that the fuselage and fuel system of returned planes were much more likely to be damaged by bullets or flak than the engines. If this was a Big Data exercise, all the data would point to a requirement to beef up the protection on the fuselage and fuel systems. Wald’s actual response- protect the engines! The realisation was that he was only looking at partial data- planes that had successfully returned. Planes that got hit in the engine did not make it back to be ‘datafied’.
2) “Companies want to make decisions in ever reducing time frames”. The resolution of this challenge sits not so much with crunching data as with how quickly a company can affect any changes. To make large, significant changes takes time- weeks or months- as the degree of sign-off, alignment and engagement tends to require multiple conversations with multiple areas of the business, and frequently raises a need for work streams to re-prioritise existing work stacks. There is a direct causal relationship between the size of change and the time required to implement that change. As the size of change potential reduces so too does the amount of data required to make that change decision. In the world of Call Centres, for example, to implement an optimisation program takes months because you need to negotiate through contractual shift patterns, training requirements, hierarchy organisation etc. regardless of what changes you want to make. The amount of data needed to make the optimisation decisions might be wide-ranging- taking in everything from individual agent skill sets to the reaction latency of support systems to projected calling behaviour of customers- but the time to complete the changes means instantaneous delivery of data analysis is not going to affect the duration required to complete the changes. Looking at what changes can be made very quickly shows how little data is required at this level. For example changes which can be actioned within a few minutes of a decision being made might be aligning call queues or agent skill sets to deal with a surprise influx of calls. You can’t bring more agents in or send agents home in that time- all you can do is re-arrange the resource you have available. The amount of data and analysis required to support this level of decision making is also significantly reduced. You would want to know how many calls are coming in and roughly what skill sets are required to answer those calls, and you would want to know how many agents you have available to move around, and what skill sets they have. Two snapshot views from which to construct a quick ‘best fit’ set up- this is not Big Data!
3) “You no longer need to see causality- correlation alone will provide the actionable insight”. This is a natural by-product of Big Data- essentially there is now too much data to be able to analyse it with depth, but because there is so much data you don’t actually need that deep-dive investigation because you won’t learn anything extra from it. This is a dangerous- and extremely lazy- way of proceeding, inciting growth without development. Correlation might be great at identifying areas for further investigation through correlation results; it is not an end point. As discussed in point 1 Big Data does not hold all the data, just all the available data. Investigating causality will identify areas of missing data which will enhance the next stage of Big Data crunching, and so on in a progressive loop. An illustration of this point is the development of science, leading up to the Large Hadron Collider. George Box (1987) 8 stated “all models are wrong- just some of them are useful”, referring to the scientific methodology of formulating a theory, testing against all known data and holding it as dominant until new data knocks it and a new model is formulated. Newtonian physics are fine to a macro level but don’t hold up at the atomic level. Quantum physics seemed to be the answer as new data showed up the flaws in Newtonian physics, until enhanced measurement tools suggested this too is just an approximation to the truth- albeit far refined. The LHC is currently trying to obtain physical measurements of hypothetical models to prove or disprove a degree of accuracy before moving on to the next best-fit model. Theories have moved into as yet untestable areas- highly hypothetical mathematic suppositions involving energies we cannot yet mechanically generate and in dimensions we have difficulty in even visualising. But because the relationships are understood we continue to push both the reasoning and the development of physical testing or observation. If we relied purely on correlations from existing data sets we would not know where to develop our investigations- we would rely on either unstructured, chaotic development or literally stagnate. We know we want to measure areas because we know what data we currently don’t have, because we know our data.
4) “There will be a shortage of appropriately skilled analysts”. This is chronic scaremongering. For companies that adopt Big Data strategies there will be a need to retrain existing analysts and possibly recruit new ones- but that would be the case with any new analytic strategy. Competition for resource will be set by the overall market desire to adopt increased analytical guidance, and if we take the SAS 5 report over Gartner 4 the advent of Big Data is unlikely to significantly impact.
The advance in data storage and manipulation is clearly a good thing- we can do more with it than we could without it. But it is not a single solution despite the constant evangelising and cajoling in the trade media. In conjunction with other analytics it is a very powerful tool-when it is needed. In most industries it is difficult to see where there is a requirement- most industries simply do not handle or process enough data (9).
The confusion between Big Data and All Data is one that is apparent in the constant placement of web adverts selected to appeal to my particular demographics. The problem is these sites do not know everything about me and the assumptions in play constantly demonstrate this issue. I am a frequent web use but I will never respond to a web advert. NEVER. I do not appreciate their invasive manner- rather than attracting me they repel. I do not like their content. I am an iPhone adopter so trying to sell Android smartphones is useless. I am not interested in the constant variation of dating sites available for my age group. I recently bought some DVD’s online as a present for a friend- I am now bombarded with film choices which might appeal to my friend but are not to my taste. The correlations in play do not elicit the required promotions because they do not have understanding behind them- they do not understand me. A 2010 paper (10) found that targeted ads were no more effective in eliciting positive reactions than standard untargeted banner ads.
In case this is sounding very negative and luddite-like, let me be clear- Big Data technology is a wonderful development. We are developing ways of storing and handling significantly more data, and we are able to process more data in quicker timescales. Big Data techniques also let us homogenise disparate data types, and people are looking at the value stored in non-quantifiable data types too. It is also encouraging companies to put their trust in data-lead decision making far more than has previously been the case. Big Data encourages progress in an appropriate and valuable direction. Without Big Data capability the CERN Large Hadron Collider (12) would not have isolated the evidence for the existence of Higgs particles- a handful of observations out of the hundreds of millions of observable opportunities measured each second. Avis Car Hire (13) increased revenue through the hardest part of the recent US recession by applying Big Data analytics to customer activity data to spot new market opportunities.
However Big Data is not yet for everyone and it is certainly not a miracle cure. It still needs appropriate understanding of the outputs to make it affective- just like any other analysis. It also requires businesses to be ready to react to the results- changing policies, processes, strategies and direction quickly. This is a massive operational shift for most enterprises. The reality is that in adopting Big Data a company needs to understand it is typically adopting a new culture- not just improving the technical architecture of its Intelligence department. Company’s need to realise just how much they need to take on to get Big Data working effectively for them- and only then can they determine if it is a worthwhile exercise at this time.
Give us a call or e-mail to discuss how we at Prosperity 24.7 can help your business on+44 (0) 1534 877247 or firstname.lastname@example.org.
1 Douglas, Laney. “3D Data Management: Controlling Data Volume, Velocity and Variety”. Gartner. Retrieved 6 February 2001.
8 Box, George E. P.; Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley