An intro of sorts to the concept of Big Data, based on a book “Big Data: A revolution that will transform how we live, work and think” by Viktor Mayer-Schonberger and Kenneth Cukier.
Do you remember your stats classes?
Statistics involves a few basic steps. Form a hypothesis for a population. Take a random sample, run various statistical tests and come to a conclusion with a certain amount of confidence level about the population, from this small quantity of data. (Samples need to be random to get an accurate unbiased picture.)
An example: I have a hypothesis that for Indians, with an increase in salaries, the time spent on internet increases. I interview a random set of people across the country, note down their salaries and the hours they spend on the internet. I calculate the regression coefficient and if it is more than 0.9, I assume that my hypothesis is correct. Based on these results, I consider that the hypothesis is true for entire India.
This sampling methodology has been followed because it is extremely difficult to do 3 tasks with large amounts of data – collect, store and analyze.
But what if, with recent progress in technology, we can do all of these on the entire population?
And that is Big Data.
We now have both the infrastructure and technological prowess to handle huge amount of data.
Big Data is no more worried about misses in few data points, since computers now deal with millions of points.
Also, what with various data sets available, instead of having a hypothesis and testing it, why not just check trends by matching unrelated data sets? Correlation might soon become more or equally important as causation. (Correlation is the dependence of 2 variables on each other and causation indicates that change in 1 variable causes change in another.)
We are encountering more and more examples of Big Data. One given in the book is how Google utilized user search data to track the spread of flu in US in 2006. Though neither of the 2 data sets are related – search terms and people with illness, the results were surprisingly accurate.
In their translator software, instead of direct translation, Google uses statistical matches. Meaning, Google picks up words which match a word maximum number of times it comes across when it scrolls the web.
Amazon recommends books inferring from what users browse (again millions of data points) and finds trends. Each time you click on a suggested book, the trend is strengthened, and if you don’t, the algorithm learns and makes changes.
But there are concerns:
A lot of data is currently being tracked, like which websites you visit, what you buy, etc., which brings up issues of privacy. (Every time we go on the internet, we leave behind such details, known as “data exhaust”.)
Also, for the above mentioned example, I might even come to the conclusion that Indians with higher salaries suffer greatly from hypertension, looking at what they search on the internet. But are all such conclusions correct?
The authors suggest a new class of auditors called “algorithmists”, who will review whether data has been collected as per set guidelines and also validate the predictions.
Going forth, the data we process will more closely represent how the true world actually is. Messy, non-linear, huge. And the way we deal with it, i.e. the field of statistics, will be entirely different than in the information-starved days.
The authors expect this to bring in a revolution in not only the way things work, but also how we start viewing and thinking about our world in coming times.