Sunday, November 27, 2016

Big data for BI people

“You are going to join a Big Data team” … that’s how all the story started. It’s obvious, that after learning all the Data warehousing concepts and working in several tools and newly added Business Discovery tool what will be the next hop. Definitely Big data. So I started learning and the first thing my friends told me, Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, and everyone thinks everyone else is doing it, so everyone claims they are doing it...hmm, interesting. Then with usual reflex I typed, “What is Big data?” Google returned plenty of similar sentence throughout the web, large data set, trillions of data, cannot be handled by traditional system, Volume, Velocity, Varity, Veracity blah blah blah blah.

            I believe similar stories many of us has, who are shifting their gear from Business Intelligence to Big data Analytics. Best way to learn TO ME is learning through comparison. This article will provide a similar short comparative analysis between Big data and Business Intelligence (For kick start purpose).

            Both are similar in nature as the goal is similar – Decision support system for the decision makers. Business Intelligence in beginning also did the same thing - processing large data set, analyse it with different parameters and present it with proper data visualization. Problem started when data grows significantly and new data influencers (Social Networking) are introduced in our blue planet. Business intelligence based on the 3NF Data warehouse were not being able to process that amount of data in a small segment of time and produce result for analytics. Ultimately data warehouse is a RDBMS which based on seek and read. Dough Cutting inventor of Hadoop came to save the world by mixing parallel processing, data indexing, distributed network with the filesystem approach. Denouncing the RDBMS concept of data processing HDFS (Hadoop Distributed File System) was born considering storage as filesystem. 

            Unlike the RDBMS, Hadoop keep the data redundantly in at least 3 place (Replication factor) which called Data Nodes and keep the metadata in a Name Node. This setup enhance the concept of ACID (Atomicity, Consistency, Isolation and Durability) to BASE (Basically Available, Soft state, eventually consistent) or CAP (Consistency, Availability, Partition tolerance). A big headache of data backup is gone now as it is integrated with data loading.

Let’s clear the new BASE Jargon further:

Basically Available: If a computing unit get failed, another node in a cluster will still be available for queries.

Soft state: State of the data can change over time, even without any additional data changes being made. This is because of eventual consistency.

Eventual consistency: The trouble with trying to maintain changing data in a cluster-based data storage system is that data is replicated across multiple locations. A change that is made in one place may take a while to propagate to another place. So, if two people send a query at the same time and hit two different replicated versions of the data, they may get two different answers. Eventually, the data will be replicated across all copies, and the data, assuming no other changes are made in the meantime, will then be consistent. This is called “eventual consistency.”

Data warehousing was much specialized in nature where lots of rules are there. There is infamous four dictum:  Subject –Oriented, Integrated, Time variant, and Non- volatile. HDFS do not have this rulebook. You can store data from any subject, from any format – structured (Database tables), semi structured (xml, json), unstructured (twitter feed), it can be volatile or non-volatile and you can store data with timestamp or without time stamp. No boundaries means more freedom but also can be full of garbage if not carefully designed.

So what brings that freedom? It’s MapReduce. MapReduce is the underlying programming model on which Hadoop is completely dependent (Later they introduced YARN). Wikipedia page of MapReduce easily explained how it is done: 

"Map" step: Each worker node applies the "map()" function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.
"Shuffle" step: Worker nodes redistribute data based on the output keys (produced by the "map()" function), such that all data belonging to one key is located on the same worker node.
"Reduce" step: Worker nodes now process each group of output data, per key, in parallel.  

As we understood how the data is stored let’s go to the architecture of Hadoop which commonly known as Hadoop Ecosystem which gives us tools which we are going to use.

Here in Big data there are separate tools for E T and L. Sqoop does the extraction and load from heterogeneous sources to Hadoop. Once the data is in Hadoop Pig can do all the transformation what and ETL tool can do. After transformation it store the data in Hadoop only. As the SQL is always in demand Apache gave us Hive to run HQL (Hive query language) which follows similar syntax to Mysql. You can do normal DML operation in Hive once the data is loaded in Hive. Hive is not a traditional RDBMS, it’s a huge MapReduce program which gives you a feel of RDBMS. You can also load the HDFS data in traditional RDBMS like MySQL, PostgreSQL through external table. You can run Java, Python or similar languages to do several task. Now you have sqoop jobs, pig script, java codes, shell files and you want to create a workflow so you can run it again and again. Oozie does that for you.

Any other application can be connected through different connecters available in Apache distribution. Any BI reporting tools can connect to HDFS. As they need structured data to implement their visualization they can connect to Hive or the other traditional database loaded from the External tables of HDFS.

As it is an open source platform every other day a new tool is introduced or get connected with Hadoop platform. Just a snapshot of the tools in BIG data diaspora captured in the picture  

Now we know the basics and tools. Then what are those Cloudera, Horton works, MapR do, if everything is Apache. Apache Hadoop is the core and Cloudera, Hortonworks, MapR, Pivotal are the distribution. They create their own tool or setup to package the base architecture provided by Apache. As an example, Cloudera Hue tool can create Oozie workflows graphically. Pivotal Hawq (Now incubating in Apache) is a RDBMS comes with Horton works Data Platform. 

Finally the comparison is over and you are ready to learn Big Data. My intension was to give you a kickstart by comparing the basic similarity between Apple and Orange – both are fruits. There is no other comparison between both as Hadoop is actually a middleware infrastructure for parallelism. Yes it can reduce some of the ETL overhead and bring lightning speed in data retrieval but it is not designed for replacing Data Warehouse.