“You
are going to join a Big Data team” … that’s how all the story started. It’s
obvious, that after learning all the Data warehousing concepts and working in
several tools and newly added Business Discovery tool what will be the next
hop. Definitely Big data. So I started learning and the first thing my friends
told me, “Big data is like teenage sex: everyone talks about it, nobody
really knows how to do it, and everyone thinks everyone else is doing it, so
everyone claims they are doing it...” hmm, interesting. Then with
usual reflex I typed, “What is Big data?” Google returned plenty of similar
sentence throughout the web, large data set, trillions of data, cannot be
handled by traditional system, Volume, Velocity, Varity, Veracity blah blah
blah blah.
I believe similar stories many of us
has, who are shifting their gear from Business Intelligence to Big data
Analytics. Best way to learn TO ME is learning through comparison. This article
will provide a similar short comparative analysis between Big data and Business
Intelligence (For kick start purpose).
Unlike the RDBMS, Hadoop keep the
data redundantly in at least 3 place (Replication factor) which called Data
Nodes and keep the metadata in a Name Node. This setup enhance the concept of
ACID (Atomicity, Consistency, Isolation and Durability) to BASE (Basically
Available, Soft state, eventually consistent) or CAP (Consistency,
Availability, Partition tolerance). A big headache of data backup is gone now
as it is integrated with data loading.
Let’s
clear the new BASE Jargon further:
Basically
Available: If a computing unit get failed, another node in a cluster will still
be available for queries.
Soft
state: State of the data can change over time, even without any additional data
changes being made. This is because of eventual consistency.
Eventual
consistency: The trouble with trying to maintain changing data in a
cluster-based data storage system is that data is replicated across multiple
locations. A change that is made in one place may take a while to propagate to
another place. So, if two people send a query at the same time and hit two
different replicated versions of the data, they may get two different answers.
Eventually, the data will be replicated across all copies, and the data,
assuming no other changes are made in the meantime, will then be consistent.
This is called “eventual consistency.”
Data
warehousing was much specialized in nature where lots of rules are there. There
is infamous four dictum: Subject
–Oriented, Integrated, Time variant, and Non- volatile. HDFS do not have this
rulebook. You can store data from any subject, from any format – structured
(Database tables), semi structured (xml, json), unstructured (twitter feed), it
can be volatile or non-volatile and you can store data with timestamp or
without time stamp. No boundaries means more freedom but also can be full of
garbage if not carefully designed.
So
what brings that freedom? It’s MapReduce. MapReduce is the underlying
programming model on which Hadoop is completely dependent (Later they
introduced YARN). Wikipedia page of MapReduce easily explained how it is
done:
"Map" step: Each worker node
applies the "map()" function to the local data, and writes the
output to a temporary storage. A master node ensures that only one copy of
redundant input data is processed.
"Shuffle" step: Worker nodes
redistribute data based on the output keys (produced by the "map()"
function), such that all data belonging to one key is located on the same
worker node.
"Reduce" step: Worker nodes
now process each group of output data, per key, in parallel.
|
As
we understood how the data is stored let’s go to the architecture of Hadoop
which commonly known as Hadoop Ecosystem which gives us tools which we are
going to use.
Here in Big data there are separate tools for E T and L. Sqoop does the extraction and load from heterogeneous sources to Hadoop. Once the data is in Hadoop Pig can do all the transformation what and ETL tool can do. After transformation it store the data in Hadoop only. As the SQL is always in demand Apache gave us Hive to run HQL (Hive query language) which follows similar syntax to Mysql. You can do normal DML operation in Hive once the data is loaded in Hive. Hive is not a traditional RDBMS, it’s a huge MapReduce program which gives you a feel of RDBMS. You can also load the HDFS data in traditional RDBMS like MySQL, PostgreSQL through external table. You can run Java, Python or similar languages to do several task. Now you have sqoop jobs, pig script, java codes, shell files and you want to create a workflow so you can run it again and again. Oozie does that for you.
Any other application can be connected through different connecters available in Apache distribution. Any BI reporting tools can connect to HDFS. As they need structured data to implement their visualization they can connect to Hive or the other traditional database loaded from the External tables of HDFS.
As
it is an open source platform every other day a new tool is introduced or get
connected with Hadoop platform. Just a snapshot of the tools in BIG data
diaspora captured in the picture
Now
we know the basics and tools. Then what are those Cloudera, Horton works, MapR do,
if everything is Apache. Apache Hadoop is the core and Cloudera, Hortonworks,
MapR, Pivotal are the distribution. They create their own tool or setup to
package the base architecture provided by Apache. As an example, Cloudera Hue
tool can create Oozie workflows graphically. Pivotal Hawq (Now incubating in
Apache) is a RDBMS comes with Horton works Data Platform.
Finally
the comparison is over and you are ready to learn Big Data. My intension was to
give you a kickstart by comparing the basic similarity between Apple and Orange
– both are fruits. There is no other comparison between both as Hadoop is
actually a middleware infrastructure for parallelism. Yes it can reduce some of
the ETL overhead and bring lightning speed in data retrieval but it is not
designed for replacing Data Warehouse.