In the world of new ideas, new processes and digital information Big Data has arisen as a discussion topic, control mechanism and the way to make more money by collecting information on a massive scale. The information (data) collected can be structured or unstructured and include location, browsing patterns, text, audio, video, images, behavioural patterns, trends and more. It is labelled ‘Big Data’ because the sheer volume of data collected makes it too difficult to process using traditional database and software techniques and it moves too fast to be handled by existing processing capacity.
What is Big Data used for?
A primary use for the data collected is to discover patterns and trends related to human behaviour and how we interact with technology. The results generated by the analysis of the data can be used to make decisions that impact how we live, work, and play
How much Data is out there?
- By 2020, there will be around 40 trillion gigabytes of data (40 zettabytes). (Source: EMC)
- 90% of all data has been created in the last two years. (Source: IBM)
- Today it would take a person approximately 181 million years to download all the data from the internet. (Source: org)
- Internet users generate about 2.5 quintillion bytes of data each day. (Source: Data Never Sleeps 5.0)
- In 2018, internet users spent 2.8 million years online. (Source: Global Web Index)
- Social media accounts for 33% of the total time spent online. (Source: Global Web Index)
- In 2019, there are 2.3 billion active Facebook users, and they generate a lot of data. (Source: Data Never Sleeps)
- Twitter users send nearly half a million tweets every minute.(Source: Domo)
- 2% of organizations are investing in big data and AI. (Source: New Vantage)
- Using big data, Netflix saves $1 billion per year on customer retention. (Source: Inside Big Data)
- What is big data and analytics market worth in 2019? $49 billion, says Wikibon. (Source: Wikibon)
- In 2019, the big data market is expected to grow by 20%. (Source: Statista
- Job listings for data science and analytics will reach around 2.7 million by 2020. (Source: Forbes)
- By 2020, every person will generate 1.7 megabytes in just a second. (Source: Domo)
- Automated analytics will be vital to big data by 2020. (Source: Flat World Solutions)
Where is Big Data analysis used?
It is fair to assume that every industry and sector has a use for the results and insights provided by big data analysis. Just to name a few of them:
- Financial markets
- Gambling and betting
- Urban planning
- Retail banking
- Mining and resources
- Consumer products (Fast Moving Consumer Goods – FMCG)
- Healthcare and pharmaceutical
- Web analytics
What tools are used to analyse Big Data?
There are several software tools used to analyse big data which include NoSQL databases, Hadoop, and Spark. With the help of big data analytics tools, we can gather different types of data from the most versatile sources – digital media, web services, business apps, machine log data, etc.
The following comparison between Hadoop and Spark appeared in:
“Upon first glance, it seems that using Spark would be the default choice for any big data application. However, that’s not the case. MapReduce has made inroads into the big data market for businesses that need huge datasets brought under control by commodity systems. Spark’s speed, agility, and relative ease of use are perfect complements to MapReduce’s low cost of operation.
The truth is that Spark and MapReduce have a symbiotic relationship with each other. Hadoop provides features that Spark does not possess, such as a distributed file system and Spark provides real-time, in-memory processing for those data sets that require it. The perfect big data scenario is exactly as the designers intended—for Hadoop and Spark to work together on the same team.”
a general-purpose form of distributed processing that has several components: the Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster; YARN, a schedule that coordinates application runtimes; and MapReduce, the algorithm that actually processes the data in parallel. Hadoop is built in Java, and accessible through many programming languages, for writing MapReduce code, including Python, through a Thrift client. (https://logz.io/blog/hadoop-vs-spark/)
Spark is structured around Spark Core, the engine that drives the scheduling, optimizations, and RDD abstraction, as well as connects Spark to the correct filesystem (HDFS, S3, RDBMs, or Elasticsearch). There are several libraries that operate on top of Spark Core, including Spark SQL, which allows you to run SQL-like commands on distributed data sets, MLLib for machine learning, GraphX for graph problems, and streaming which allows for the input of continually streaming log data.
Spark has several APIs. The original interface was written in Scala, and based on heavy usage by data scientists, Python and R endpoints were also added. Java is another option for writing Spark jobs. (https://logz.io/blog/hadoop-vs-spark/)
What programming skills are required to work with Big Data?
Coding is essential to undertake numerical and statistical analysis with massive data sets and the languages that are currently in use include Python, R, Java, and C++.
The collection of data will continue to expand on a massive scale impacting our daily lives in many ways. We live in a world where almost everything is on view and the more data that is collected the more others know and understand our behavioural patterns.