Open In App

Hadoop – Introduction

Last Updated : 29 Jul, 2021
Improve
Improve
Like Article
Like
Save
Share
Report

The definition of a powerful person has changed in this world. A powerful is one who has access to the data. This is because data is increasing at a tremendous rate. Suppose we are living in 100% data world. Then 90% of the data is produced in the last 2 to 4 years. This is because now when a child is born, before her mother, she first faces the flash of the camera. All these pictures and videos are nothing but data. Similarly, there is data of emails, various smartphone applications, statistical data, etc. All this data has the enormous power to affect various incidents and trends. This data is not only used by companies to affect their consumers but also by politicians to affect elections. This huge data is referred to as Big Data. In such a world, where data is being produced at such an exponential rate, it needs to maintained, analyzed, and tackled. This is where Hadoop creeps in. 

Hadoop is a framework of the open source set of tools distributed under Apache License. It is used to manage data, store data, and process data for various big data applications running under clustered systems. In the previous years, Big Data was defined by the “3Vs” but now there are “5Vs” of Big Data which are also termed as the characteristics of Big Data. 
 

  1. Volume: With increasing dependence on technology, data is producing at a large volume. Common examples are data being produced by various social networking sites, sensors, scanners, airlines and other organizations.
  2. Velocity: Huge amount of data is generated per second. It is estimated that by the end of 2020, every individual will produce 3mb data per second. This large volume of data is being generated with a great velocity.
  3. Variety: The data being produced by different means is of three types: 
    • Structured Data: It is the relational data which is stored in the form of rows and columns.
    • Unstructured Data: Texts, pictures, videos etc. are the examples of unstructured data which can’t be stored in the form of rows and columns.
    • Semi Structured Data: Log files are the examples of this type of data.
  4. Veracity: The term Veracity is coined for the inconsistent or incomplete data which results in the generation of doubtful or uncertain Information. Often data inconsistency arises because of the volume or amount of data e.g. data in bulk could create confusion whereas less amount of data could convey half or incomplete Information.
  5. Value: After having the 4 V’s into account there comes one more V which stands for Value!. Bulk of Data having no Value is of no good to the company, unless you turn it into something useful. Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5V’s

 

Evolution of Hadoop: Hadoop was designed by Doug Cutting and Michael Cafarella in 2005. The design of Hadoop is inspired by Google. Hadoop stores the huge amount of data through a system called Hadoop Distributed File System (HDFS) and processes this data with the technology of Map Reduce. The designs of HDFS and Map Reduce are inspired by the Google File System (GFS) and Map Reduce. In the year 2000 Google suddenly overtook all existing search engines and became the most popular and profitable search engine. The success of Google was attributed to its unique Google File System and Map Reduce. No one except Google knew about this, till that time. So, in the year 2003 Google released some papers on GFS. But it was not enough to understand the overall working of Google. So in 2004, Google again released the remaining papers. The two enthusiasts Doug Cutting and Michael Cafarella studied those papers and designed what is called, Hadoop in the year 2005. Doug’s son had a toy elephant whose name was Hadoop and thus Doug and Michael gave their new creation, the name “Hadoop” and hence the symbol “toy elephant.” This is how Hadoop evolved. Thus the designs of HDFS and Map Reduced though created by Doug Cutting and Michael Cafarella, but are originally inspired by Google. For more details about the evolution of Hadoop, you can refer to Hadoop | History or Evolution

Traditional Approach: Suppose we want to process a data. In the traditional approach, we used to store data on local machines. This data was then processed. Now as data started increasing, the local machines or computers were not capable enough to store this huge data set. So, data was then started to be stored on remote servers. Now suppose we need to process that data. So, in the traditional approach, this data has to be fetched from the servers and then processed upon. Suppose this data is of 500 GB. Now, practically it is very complex and expensive to fetch this data. This approach is also called Enterprise Approach. 
In the new Hadoop Approach, instead of fetching the data on local machines we send the query to the data. Obviously, the query to process the data will not be as huge as the data itself. Moreover, at the server, the query is divided into several parts. All these parts process the data simultaneously. This is called parallel execution and is possible because of Map Reduce. So, now not only there is no need to fetch the data, but also the processing takes lesser time. The result of the query is then sent to the user. Thus the Hadoop makes data storage, processing and analyzing way easier than its traditional approach. 

Components of Hadoop: Hadoop has three components: 

  1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data with a cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables data to be stored at multiple nodes in the cluster which ensures data security and fault tolerance.
  2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this data is stored, this is called Mapping. Now the query is broken into multiple parts and the results of all these multiple parts are combined and the overall result is sent back to the user. This is called reduce process. Thus while HDFS is used to store the data, Map Reduce is used to process the data.
  3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated operating system for Hadoop which manages the resources of the cluster and also functions as a framework for job scheduling in Hadoop. The various types of scheduling are First Come First Serve, Fair Share Scheduler and Capacity Scheduler etc. The First Come First Serve scheduling is set by default in YARN.

How the components of Hadoop make it as a solution for Big Data? 

  1. Hadoop Distributed File System: In our local PC, by default the block size in Hard Disk is 4KB. When we install Hadoop, the HDFS by default changes the block size to 64 MB. Since it is used to store huge data. We can also change the block size to 128 MB. Now HDFS works with Data Node and Name Node. While Name Node is a master service and it keeps the metadata as for on which commodity hardware, the data is residing, the Data Node stores the actual data. Now, since the block size is of 64 MB thus the storage required to store metadata is reduced thus making HDFS better. Also, Hadoop stores three copies of every dataset at three different locations. This ensures that the Hadoop is not prone to single point of failure.
  2. Map Reduce: In the simplest manner, it can be understood that MapReduce breaks a query into multiple parts and now each part process the data coherently. This parallel execution helps to execute a query faster and makes Hadoop a suitable and optimal choice to deal with Big Data.
  3. YARN: As we know that Yet Another Resource Negotiator works like an operating system to Hadoop and as operating systems are resource managers so YARN manages the resources of Hadoop so that Hadoop serves big data in a better way.

Hadoop Versions: Till now there are three versions of Hadoop as follows. 

  • Hadoop 1: This is the first and most basic version of Hadoop. It includes Hadoop Common, Hadoop Distributed File System (HDFS), and Map Reduce. 

     

  • Hadoop 2: The only difference between Hadoop 1 and Hadoop 2 is that Hadoop 2 additionally contains YARN (Yet Another Resource Negotiator). YARN helps in resource management and task scheduling through its two daemons namely job tracking and progress monitoring. 

     

  • Hadoop 3: This is the recent version of Hadoop. Along with the merits of the first two versions, Hadoop 3 has one most important merit. It has resolved the issue of single point failure by having multiple name nodes. Various other advantages like erasure coding, use of GPU hardware and Dockers makes it superior to the earlier versions of Hadoop.
    • Economically Feasible: It is cheaper to store data and process it than it was in the traditional approach. Since the actual machines used to store data are only commodity hardware.
    • Easy to Use: The projects or set of tools provided by Apache Hadoop are easy to work upon in order to analyze complex data sets.
    • Open Source: Since Hadoop is distributed as an open source software under Apache License, so one does not need to pay for it, just download it and use it.
    • Fault Tolerance: Since Hadoop stores three copies of data, so even if one copy is lost because of any commodity hardware failure, the data is safe. Moreover, as Hadoop version 3 has multiple name nodes, so even the single point of failure of Hadoop has also been removed.
    • Scalability: Hadoop is highly scalable in nature. If one needs to scale up or scale down the cluster, one only needs to change the number of commodity hardware in the cluster.
    • Distributed Processing: HDFS and Map Reduce ensures distributed storage and processing of the data.
    • Locality of Data: This is one of the most alluring and promising features of Hadoop. In Hadoop, to process a query over a data set, instead of bringing the data to the local computer we send the query to the server and fetch the final result from there. This is called data locality.


Next Article
Introduction to Google Cloud Bigtable

Similar Reads

Introduction to Hadoop Distributed File System(HDFS)
With growing data velocity the data size easily outgrows the storage limit of a machine. A solution would be to store the data across a network of machines. Such filesystems are called distributed filesystems. Since data is stored across a network all the complications of a network come in. This is where Hadoop comes in. It provides one of the most
5 min read
Difference Between Hadoop and MapReduce
Hadoop: Hadoop software is a framework that permits for the distributed processing of huge data sets across clusters of computers using simple programming models. In simple terms, Hadoop is a framework for processing ‘Big Data’. Hadoop was created by Doug Cutting.it was also created by Mike Cafarella. It is designed to divide from single servers to
3 min read
Top 10 Hadoop Analytics Tools For Big Data
Hadoop is an open-source framework written in Java that uses lots of other analytical tools to improve its data analytics operations. The article demonstrates the most widely and essential analytics tools that Hadoop can use to improve its reliability and processing to generate new insight into data. Hadoop is used for some advanced level of analyt
5 min read
Top 7 Reasons to Learn Hadoop
Hadoop is a data processing tool used to process large size data over distributed commodity hardware. The trend of Big Data Hadoop market is on the boom and it's not showing any kind of deceleration in its growth. Today, industries are capable of storing all the data generated at their business at an affordable price just because of Hadoop. Hadoop
5 min read
10 Best Recommended Books To Learn Hadoop
Hadoop is a Big Data tool that is written into Java to analyze and handle very large-size data using cheaper systems/servers. It is also known for its efficient and reliable storage technique. Hadoop works on MapReduce Programming Algorithm and Master-Slave architecture. Top Companies like Facebook, Yahoo, Netflix, eBay, etc. are using Hadoop in th
8 min read
How to Become a Hadoop Developer?
If you've ever come across the 'Big Data' term (which is quite common in the present-day scenario) then you must have heard about the 'Hadoop' as well. A major fraction of the big tech companies is utilizing the Hadoop technology for managing their huge distributed datasets. Statistically, the Hadoop market is expected to grow more than $300 Billio
8 min read
An introduction to Machine Learning
Arthur Samuel, an early American leader in the field of computer gaming and artificial intelligence, coined the term "Machine Learning " in 1959 while at IBM. He defined machine learning as "the field of study that gives computers the ability to learn without being explicitly programmed ". However, there is no universally accepted definition for ma
6 min read
Django Introduction | Set 2 (Creating a Project)
Note- This article is in continuation of Django introduction. Popularity of Django Django is used in many popular sites like as: Disqus, Instagram, Knight Foundation, MacArthur Foundation, Mozilla, National Geographic etc. There are more than 5k online sites based on Django framework. ( Source ) Sites like Hot Frameworks assess the popularity of a
3 min read
Introduction to Xamarin | A Software for Mobile App Development and App Creation
The entire world is now surrounded by billions and trillions of mobile Tech which is inevitable. The major share of the development of mobile apps is taken by the Google's Android, Apple's iOS, and Microsoft's Windows. Every new learner or newbie in Mobile Development Domain finds himself in the dilemma of choosing the platform to start with. They
9 min read
Flutter | An introduction to the open source SDK by Google
Flutter is Google’s Mobile SDK to build native iOS and Android, Desktop (Windows, Linux, macOS), and Web apps from a single codebase. When building applications with Flutter everything towards Widgets – the blocks with which the flutter apps are built. They are structural elements that ship with a bunch of material design-specific functionalities a
6 min read
Article Tags :
  • GBlog

哆哆女性网梦见新的坟墓周公解梦GTA5捏脸图文设计网站广州seo网站推广睡前100个小故事上海外贸网站建设民权特产有哪些同学二三事作文给姓曾经的男孩起名网络seo排名阿波罗十三号观后感蒙氏起名梦见爸爸去世 周公解梦企大师公司起名系统公司起名审核周易算命生辰八字合婚苹果手机动态壁纸什么人喜欢算命经超演的电视剧推广营销推荐睢县到威海如何写seo原创文章根据汉语名字起英文名神之墓地攻略八周交易计划3金湖网站建设第一宇宙速度第二宇宙速度第三宇宙速度破解版游戏宝可梦男孩起名一一的意思男宝宝起小名2019洋气淀粉肠小王子日销售额涨超10倍罗斯否认插足凯特王妃婚姻不负春光新的一天从800个哈欠开始有个姐真把千机伞做出来了国产伟哥去年销售近13亿充个话费竟沦为间接洗钱工具重庆警方辟谣“男子杀人焚尸”男子给前妻转账 现任妻子起诉要回春分繁花正当时呼北高速交通事故已致14人死亡杨洋拄拐现身医院月嫂回应掌掴婴儿是在赶虫子男孩疑遭霸凌 家长讨说法被踢出群因自嘲式简历走红的教授更新简介网友建议重庆地铁不准乘客携带菜筐清明节放假3天调休1天郑州一火锅店爆改成麻辣烫店19岁小伙救下5人后溺亡 多方发声两大学生合买彩票中奖一人不认账张家界的山上“长”满了韩国人?单亲妈妈陷入热恋 14岁儿子报警#春分立蛋大挑战#青海通报栏杆断裂小学生跌落住进ICU代拍被何赛飞拿着魔杖追着打315晚会后胖东来又人满为患了当地回应沈阳致3死车祸车主疑毒驾武汉大学樱花即将进入盛花期张立群任西安交通大学校长为江西彩礼“减负”的“试婚人”网友洛杉矶偶遇贾玲倪萍分享减重40斤方法男孩8年未见母亲被告知被遗忘小米汽车超级工厂正式揭幕周杰伦一审败诉网易特朗普谈“凯特王妃P图照”考生莫言也上北大硕士复试名单了妈妈回应孩子在校撞护栏坠楼恒大被罚41.75亿到底怎么缴男子持台球杆殴打2名女店员被抓校方回应护栏损坏小学生课间坠楼外国人感慨凌晨的中国很安全火箭最近9战8胜1负王树国3次鞠躬告别西交大师生房客欠租失踪 房东直发愁萧美琴窜访捷克 外交部回应山西省委原副书记商黎光被逮捕阿根廷将发行1万与2万面值的纸币英国王室又一合照被质疑P图男子被猫抓伤后确诊“猫抓病”

哆哆女性网 XML地图 TXT地图 虚拟主机 SEO 网站制作 网站优化