Distributed Computing, Hadoop and Myths of Network(II)

E-mail Print PDF

Authored By:Harish R

As discussed in my previous article where I had related the need for distributed computing in case of a company as big as google with general stores example and mentioned as to how and why Hadoop came into existence, here I explain  how intriguing is the concept and how it is implemented. Let us start by understanding what MapReduce is first.

The MapReduce Concept

Any problem can be solved in MapReduce in just 2 steps. The first one is called Map and the second one is called Reduce. For better understanding, I shall continue with the general stores example  used in the previous article. For now, you are the owner of a large chain of general stores with people working at various levels.

Let us now define the various teams in your store with a team leader in each. Let there be a logistics team, accounts team, sales team and marketing team. You wish to do an extravagant sales festival for new year. For this, you need to publicise, get goods, manage the crowd and sell goods properly and finally account everything. As the owner, you make a list of things to do like give ads in major newspapers, fetch goods from the warehouse, plan discounts properly and so on. The best thing to do naturally is to call the team leaders and delegate them the tasks who will in turn delegate it to their team members. The team members get the work done and report it to the team leaders who in turn submit the report to you and finally the work is done. This is how MapReduce works. The process by which the work gets delegated from you to the leaders and to the workers is called Map and the reverse from the workers to you is called Reduce. Here it was done on two levels, you to the team leaders and then to the workers and back in the same way. Thus the Map and Reduce functions can be done in multiple levels too. If you go  deeper you will realise that  teams work parallely, hence the task is accomplished in a distributed environment by parallel computing using MapReduce. Let us correlate this directly with a distributed computing environment. When a job is submitted to the master node, it processes it and creates an intermediate value with a key and other computers listening to this master computer picks up the job if it is of the key it works for. In our case, the marketing team leader takes up only the jobs marked as marketing by the owner. Then these computers either do the jobs themselves or do the map operation and pass on the work. Reduce is simply the reverse operation. Such a simple concept MapReduce is ain't it? Now I shall move on to the Hadoop implementation of MapReduce.

The Hadoop Way Hadoop is an opensource Java implementation of MapReduce by Apache. It  is designed to run on a distributed environment processing 'web-scale' data which means several peta bytes of data. Yes, you read it right, its not giga or tera but peta. For starters, assume every character you see here is a byte. One peta byte is made up of 1024*1024*1024*1024 such bytes. Instead of using super computers to do it, Hadoop can be used on a distributed environment to accomplish the same task. Hadoop provides an environment that is robust against hardware errors, but provides no guarantee that no new computer has maliciously entered the network. In other words, in your shop it takes care of all the possible errors an employee may make but provides no guarantee that the employee is a genuine worker for your company and not an outsider who came in maliciously. There are different implementations of Hadoop in several companies which have better security but the basic one is of this nature. Learning how to implement and use Hadoop is totally out of scope of this article. Here I shall just brief you about how data is stored in Hadoop and how the environment can be scaled.

How data is stored in Hadoop

Hadoop is just a software framework that can run on almost any OS. It is not guaranteed that every OS it runs on uses the same file system. Windows may use FAT32 or NTFS while Linux may use ext3 or ext4 or anything else is possible. But this should not be an issue when using these computers in a distributed environment. So they decided to create a common file system called HDFS (Hadoop Distirbuted File System) that runs on all the platforms uniformly. When you install Hadoop, you do not need to partition seperately for this, instead, it is created automatically to the size you need. Assume you have 10 computers ready to be used under Hadoop. So start by installing Hadoop in all these. Now comes the data loading part. You can sit in any computer and keep uploading the files you want. First they get stored in your computer and if your computer is full, then further data is automatically sent to other computers and you can continue loading the files. By default, the data gets replicated and stored so that a computer containing the data when corrupted, data loss is prevented. For the user, it appears like the files are stored on a single location and he can access the files from any computer. The amount of replication is user defined.

Processing of Data Let us assume that we need to calculate the total marks of all student records in the college stored on a Hadoop cluster of 10 computers. When we load the data, they get distributed and each computer has a chunk of the entire data set. For example, records of students with names starting from A to C are on computer 1, D to F are on computer 2. It makes perfect sense to let computer 1 calculate the totals on the student records which it has, that is from A to C rather than something else. So, instead of moving the data from several locations to the process, here we move the process to the data. It is pretty much like letting labourers from a specific locality handle the tasks of that locality like how a courier company does. Its a lot simpler and more efficient that way.

Scaling Imagine that in your store, there is a list of employees. Everyday  morning, this list is checked and only the names of the employees in that list are allowed to work in the store. When a new employee is recruited, his name is added to this list. When someone is fired, his name is removed from the list. So, when the list is checked next morning, the new employee is added or old employee is removed. A similar method is used in Hadoop. A list of computers that are in the network is maintained, everytime a refresh command is issued, this list is checked and necessary changes are made. When a new computer is added, data needs to be loaded to it and similarly when a computer is removed, the data that was stored on it needs to get shifted to some other place. All this are handled automatically by Hadoop. By this method, we can scale the architecture to as many computers as many we want. There are several other issues that need to be handled which I find are more technical. But let’s just say that Hadoop takes care of all that. Thus Hadoop is used to process large amounts of data easily in a distributed environment. I hope that, this article was informative enough to provide the user an introduction to Hadoop. In my next article, I shall explain another popular use of distributed systems called cloud computing which is gradually taking the world with a storm. Do let me know your feedback. Until my next post, take care!

Comments (8)add comment

murthy said:

July 14, 2010
Votes: +0

Tweets that mention Reader's Quotient - Distributed Computing, Hadoop and Myths of Network(II) -- Topsy.com said:

0
...
[...] This post was mentioned on Twitter by Harish. Harish said: "Reader's Quotient - Distributed Computing, Hadoop and Myths of Network(II)" ( http://bit.ly/a4OSKN ) My article! [...]
 
July 14, 2010 | url
Votes: +0

Shishir Kumar Jha said:

0
...
its a mind blowing article with a totally convincing explanation of the latest things getting added in the networking sector that all are user friendly. Hats Off!!!!!
 
July 15, 2010
Votes: +0

Harish said:

July 15, 2010 | url
Votes: +0

forex robot said:

0
...
nice post. thanks.
 
July 16, 2010 | url
Votes: +0

Saurabh Ganguly said:

0
...
That's excellent, good work smilies/smiley.gif
 
July 21, 2010
Votes: +0

Norman Samy said:

0
...
Hi there can I quote some of the content from this entry if I reference you with a link back to your site?
 
August 31, 2010 | url
Votes: +0

Donna Boslet said:

0
...
Damn, cool website. I actually came across this on Bing, and I am stoked I did. I will definately be revisiting here more often. Wish I could add to the conversation and bring a bit more to the table, but am just absorbing as much info as I can at the moment.

Thank You

VAT
 
October 07, 2010 | url
Votes: +0

Write comment

busy