Distributed Computing and Myths Of Network

E-mail Print PDF
Image Source:http://www.naccq.ac.nz/bacit/0203/2004Caukill_OffPeakGrid_files/2004CaukillFigure1.jpg.........Acknowledged

Authored by Harish R

I log in to my computer. Connect to a specific remote server. I type in a few keywords and immediately that remote server orders over half a million computers to work in tandem to solve the problem that I had presented them with and gives me the output. This is not a fantasy story nor am I a millionaire to have such equipment with me. I am just a normal engineering student from a normal college in India. But how do I get access to such a resource? The truth is, not only me but everyone reading this has access to the very same resource I had mentioned earlier. Wondering how?

Try this. Connect to google.com, search for anything and that's what happens in the background. Over half a million computers in Google's datacenters across the globe work in tandem to serve all its users. Why do we need so many computers? How do they work in tandem?

I just hope this article will be able to answer that. Welcome to the world of distributed computing.

The need for distributed computing: Let us consider a real life example to answer this. For our example, consider you run a general store in your neighbourhood. Initally when the store started, only you were there working as an attender, managing the accounts, inventory and everything else. But gradually as you get more customers everyday, you begin to realise that having one more person to help you is  beneficial in many ways. So you hire a labour and the increased number of customers is handled well. As days pass by, you recruit more employees and your shop develops well. One day you realise that opening a branch of your store in another locality increases your sales even more and you do that. After a few years, you have a number of branches with several employees in all levels working for you and serving your customers. This is precisely the need for distributed computing. When an organisation like Google (for our example) is started, they need only one computer to satisfy their needs (the one they used initially at Stanford when Google was started). Slowly as the user base increased, this one computer is not able to cope up. So they had to use several computers to satisfy the needs of the people who visit their site. Today, Google runs over half a million computers to satisfy all its customers. It is because of this massive number, Google site always loads and never crashes. Though so many computers are working together, we are presented an image as if Google site is running on one single machine only. This concept of making several computers work together and give a single computer image to the user is called distributed computing. There are several advantages of using this concept. More customers can be handled, that is more services can be provided, productivity increases with parallel processing of requests, with more employees to attend the customers the load gets distributed among them (in our case the several computers working together) and the most important thing is fault tolerance. If one employee falls sick, in our case a computer crashes, the other employees compensate and the shop runs on. Of course, with a slight degradation in performance. Not many organisations requiring high performance computing can fit a super computer in their budget. The best thing they can do is to buy a bunch of normal computers, connect them together and establish a distributed environment that crunches out the same or even more computing power than a super computer at a fraction of the cost. I guess by now, the need for distributed computing is clear. If distributed computing is so cool, why do even super computers exist? It is because of the problems that we need to address in distributed computing. Read on to find out more.

The flipside of distributed computing: Let us start with the most prominent one called security. When several attenders are in the shop, they might overwrite each others lists and cause havoc. Besides, with several computers working for the same organisation, it is only more entry points for a potential attacker wishing to cause harm to the organisation. So both data security from the computers themselves on the inside and also from the attackers outside needs to be given. The second issue is connectivity between computers. Some mechanism needs to be used to monitor the computers that are online, working, the processes they run and a plan to shift and distribute the load in case something screws up in the network. In our shop, if we increase the number of attenders without increasing the number of cash counters, more customers will be handled initally but it will only create a bottleneck in the cash counter. So we need more counters. Similarly, in a distibuted computing environment, if data is stored in only one place then all the computers served access this one only and a bottleneck is created. So we need more than one place to store our data. A simple solution is to replicate the same data and store them in several places. When the data in one place is updated, it should be rippled to the other places as well. Some mechanism is needed to maintain these replicated data storage locations to have the same data, control the number of replications etc. If our distributed environment is big and it spans across several geographic boundaries, like for example we have our servers in New York and Chennai. If a request is given in Chennai, then it makes a lot of sense to use the servers in Chennai rather than using the one in New York no matter how good our network is. This needs to be handled as well. These are the problems faced when implementing a distributed computing architecture. This is why super computers still exist.

The genesis of Hadoop: By now you would have realised that most of the concepts in computing are based on real life and hence such major ideas have been worked upon for a number of years now. Surprisingly no software framework that can get the work done easily existed. The engineers had to get together, connect the computers and write programs that can run on this setup only. Adding or removing computers to this setup and getting the programs to work was a complex task. Finally Google stepped in and created a software framework called Google Mapreduce with C++. Once this framework is installed, more computers can be added or removed without affecting the code. So a program that runs on a distributed computing environment of 10 computer works well on one with 1,000,000 computers too with little or no modification. Such freedom triggered the use of distributed computing in large scale. However, Google patented its software framework and stopped many computers from using it. In our house when mom goes on a strike that she will not cook, one of the following needs to be done. Convince her, not eat from then, eat outside or learn to cook. Google did not get conviced, options two and three are just quick fixes and hence these companies had to resort to the last option. Create a software framework similar to Google Mapreduce and open source it so that others may benefit from it. Hence a company named Apache created an open source software platfrom similar to Google Mapreduce in Java and named it Hadoop. Thus, Hadoop came into existence.

Concluding Remarks: My aim of this article was to give an introductory idea about distributed computing. In my next article, I shall extend this and explain about Google Mapreduce and Hadoop. Do comment about your thoughts.Until my next post, take care!

Comments (8)add comment

riju said:

0
...
Thank you for all this informative article.Never thought so deeply.Gud job Keep it up.God bless
 
June 06, 2010
Votes: +0

Harish said:

June 06, 2010 | url
Votes: +0

Robin said:

0
...
Nice title selection and u gave a great explanation. Cheers.

- Robin
 
June 06, 2010 | url
Votes: +0

Tweets that mention Reader's Quotient - Distributed Computing and Myths Of Network -- Topsy.com said:

0
...
[...] This post was mentioned on Twitter by Harish, Reader's Quotient. Reader's Quotient said: Distributed Computing and Myths Of Network: Authored by Harish R I log in to my… http://goo.gl/fb/HaWOj 2-min read [...]
 
June 06, 2010 | url
Votes: +0

Anand said:

0
...
Nice explanation on DS using google !! sticking here to know more on this harish ...
 
June 06, 2010 | url
Votes: +0

Abhilash Owk said:

0
...
It is a remarkable post. Very informative. Nice work Harish . Keep it up smilies/smiley.gif
 
June 13, 2010 | url
Votes: +0

Harish R said:

July 02, 2010 | url
Votes: +0

Distributed Computing Environment - Topic Research, Trends and Surveys said:

0
...
[...] this has access to the very same resource I had mentioned earlier. Wondering how? Try this. ... Read More RECOMMENDED BOOKS REVIEWS AND OPINIONS ENHANCEMENT OF EDGE DETECTION AND [...]
 
July 16, 2010 | url
Votes: +0

Write comment

busy