Clouds of Hadoop

How Map Reduce Changed the World

We have been living in the age of Hadoop. I know this for certain because Yahoo! has recently announced it’s commitment to making Hadoop a commercial platform. Yahoo! has nearly a perfect record of picking winning technologies that have already crossed the finish line.

My thesis for this post is that changes in the web are initiated by changes in the technology that propel the medium. As Marshall McLuhan would assert “the medium is the message”, I claim that the Google File System (GFS) later brought in the open source domain as Hadoop was the medium for Web 2.0 and that what we call “social media” is the message.

To be more bold, the Cloud is the medium and the Web, the message. In McLuhan’s universe it’s not the content that is passed through the medium that has impact but how content passes through the medium by and for the “masses” that is the true impact. In short, it is technology adopted by the masses that is the medium. This provides an apt understanding of Web 2.0 and how we are currently as a society and culture being moved by the message.

Likewise if we are to understand what will become Web 3.0, we need to start and understand what is new in cloud computing. This task will not be easy because 1) much of what is claimed as new is actually very old but has been given new names, and 2) some of what is new will have little effect while other proposals will have major effect – or what pundits call disruptive effect – on how we interact and use the Web. That is to say: that though technology is at the core of every medium, not every technology will create a medium.

Irreverent History of the Cloud continued

To continue our history of the Cloud from Send in the Clouds: As we left the last millennium, we were preoccupied with whether or not all of the clocks in the Cloud would properly shift in 2000 and not bring world disaster and the end of life as we know it. Also the dot-com bubble bust was motivating a lot of reflection as to where to go next.

There was still the burning question of how to find out what is actually in the Cloud. There are thousands if not millions of web surfers aggregating their “best” links which are referenced and aggregated by others and so on. So the web thing is working quite well!

AltaVista and others have pioneered algorithms to automatically scan web pages and index content. Though diligently applying these algorithms will result in most of the important content being indexed in ten years, what happens when content is changed or becomes obsolete in the meantime? The task takes longer as resources are re-allocated to continually monitor and weed out broken links. Like Zeno‘s frog that jumps half the distance to the finish line with each hop, the finish line will never be crossed.

The problem was that technology was not scaling to the need. As a couple of students at Stanford worked out, the need is to scan the content faster than new content can be generated. So the problem was not indexing content but getting through all the content to be able to index it. A platform capable of scanning the web would be capable of more than just search. It could rule the world! (At least bring down the world of Microsoft).

So like Pinky and the Brain, they started out their mission to take over the world by cleverly disguising their plan as a search engine. In reality it was the engine underneath given a rather non-threatening uncharacteristic non-clever technology name – Google File System or GFS. How controversial can a file system be? Right?

File System?

Content storage is one of the problems that needs to be addressed. It is difficult to imagine a file system that holds the entire content of the Web. Actually this is not necessary since once the HTML page has been processed, what remains in indices and links should be less than the content itself, which can be conveniently referenced by the URL. Yea, the Cloud as distributed storage!

However if your objective is to process all the content on the Web and then have capacity to process changes faster than the content changes, you are going to have a lot of cached content in the processing data stream plus you may want to archive content for future reference to check if content has changed. So the problem is one of processing more than storage per se. More exactly, to be able to process files freely without blocking on access to the storage.

The technical term is serializing file access. Imagine that storage in a supermarket is the cash register – prices are recorded and cash goes in and out. Each transaction must be performed and completed to the end without interruption by another customer. So as a result, customers are queued in a line to wait their turn. The obvious solution is to increase the number of registers to handle the peaks and reduce queue lines to zero. Of course during non-peak periods, one has to deal with the surplus of cashiers, but in this case, during the initial phase of processing the entire web, the peak is continual. As more “cash registers with cashiers” are added the more content that can be processed.

So what Larry Page and Sergey Brin had to figure out was what is stored and what is queued in collecting and indexing web content and then work out the way to store content to unprecedented levels to index the entire web. The result is GFS and to accomplish this they had to take evolutionary leap from how things were done before.

Brief Parable on Scale

Often scale – meeting the processing demand of the problem – is confused with efficiency – making computations quicker with less instructions. Google’s preferred language is Python, which is not an efficient language. It is actually a scripting language that is interpreted. So what explains Google’s total disregard of conventional wisdom – for an application to scale it must be written in C/C++?

Imagine two programmers tasked to implement the same algorithm, one develops the most efficient implementation and the other the most scalable. At the end, each tests their implementation on a single computer. Comparing results one finds the efficient implementation generates 10 times more throughput than the scalable implementation. It will take 10 computers for the scalable implementation to match the performance throughput of the efficient program. Here we are assuming that by scalable we can add computers to the task without incurring additional overhead, called linear scalability.

The point is that though I need 10 computers to match the efficient program performance with 20 computers I can double capacity and 100 computers increase capacity by an order of magnitude and grow without limit theoretically. If I eventually make the scaled implementation more efficient so that the most efficient solution is only 5 times more efficient, I have in one change doubled my capacity and productivity.

This is called Sullivan’s Law, which is the dual side of his contemporary, Amdahl’s Law of Computing. Whereas Amdahl was concerned with efficient use of computing resources, Sullivan was concerned with the scalable allocation of resources to meet demand. As long as computing resources were expensive and the demands less than the need, Amdahl ruled. Now that we can have millions of processors, Sullivan’s Law simply stated says everything that Amdahl was concerned about doesn’t matter.

So the GFS is not concerned with efficient storage but linear scalable storage.

Scalability Before GFS

Storage as it turns out is a determining factor for scalability in cloud computing. In the ancient days of Unix computing, one wrote programs that could be invoked by a single command line that loaded an input and generated an output to the file system. One then could direct the output from one command as input to another command using the pipe symbol (|) forming a pipeline process.

The pipeline architecture is an efficient implementation when files are processed in bulk and read from storage at the start of the pipeline and write back to storage at the end. Data in and out of storage is perfectly coordinated and no process waits for access. If I now run more than one pipeline on separate threads accessing the same storage, the request will more likely collide and threads will queue to wait for access to the file system.

Because of queuing, the time it takes to complete all of the tasks becomes longer than if the separate pipelines could be executed in parallel without shared storage and queuing. Serialization breaks the processing advantage of concurrency. This is the processing governor that was placed on clouding computing 1.0.

In partial response to this limitation, Service Oriented Architecture was proposed. For SOA, the input is included in the request and the service provides a response that does not require accessing file storage. This works fine for the subset of services that don’t require additional data but what if one wants to do real things such as report search results or manage customer accounts or retrieve maps and directions as part of the service?

To handle real world services, the concept of orthogonal storage such as Hibernate or GMDS allows necessary content to be cached off storage and service processes to complete independent of storage queues. This simply transfers the problem to the supporting infrastructure. In the end one still has stores of files that need to be processed into databases or data warehouses.

The Evolutionary Leap in Scalability

Much that occurs on the web is independent and unrelated to each other – visitors work independently, web sites are independent, and in many ways web pages are independent and only related to each other by their links. There is an inherent concurrency that can be leveraged in processing data that is unrelated to each other.

Imagine sorting or mapping data that is related by visitor or page or keyword into different independent bins and then processing all the bins asynchronously and in parallel on different servers. Now collect the much condensed or reduced results to another place to map again and reduce in another independent dimension. So what you have is the GFS discovering the Gantt Chart used in project management only it’s now called Map Reduce.

Later Doug Cutting would develop an open source version of Map Reduce and follow the traditional approach of naming the program so that one has no idea what the program is about. He named it after his son’s stuffed elephant – Hadoop. Currently Hadoop or variations developed from Hadoop have become essential to the backend data processing of most large online enterprises. It has also been a boon to analytics when processing the beacon data from millions of browsers into analytic data warehouses.

So that’s it! That is the new source of evolutionary energy that has propelled, directly or indirectly, the growth of Web 2.0. It started with Google being able to become a search engine that could process the entire world-wide web applying what ever algorithms it wanted often using brute force AI algorithms without having to consider optimization. A capability unimagined in AI up to that time.

Why it has taken so long for technology to utilize a metaphor that is obvious to anyone in business and manufacturing is another story. One that I will look at in the next post.

What Success Looks Like

Clearly the platform initiated by Google is more capable than just for search. Google has applied it to maps – if you are going to take over the world you need to map it. Right? But it includes Twitter and Facebook with an ability to manage networks of thousands or even millions of followers. It also includes the expectation that a video taken from a cell phone can go “viral” or a revolution can become viral yet evade a repressive government. These are expectations that clearly were not realistic 10 years ago as Web 2.0 began.

So we now we have the expectation for instant moments of fame and making impact over the entire world whether it’s starting or supporting a keyword to trend on Twitter or establishing 10,000 “friends” on Facebook or promoting your own channel on YouTube. It is this expectation to reach and instantly affect others that is the message of the Cloud as the medium. The fact that the technology has been around and used for decades is not important. What is important and new is that the right connection of need and solution was selected to become a medium and not just another academic experiment.

Makes one wonder what else can be pulled from the hat! I will look at this in the next post as well as how to identify and assess disruptive technologies. For homework consider Marshall McLuhan’s Four Laws of Media and how they apply to this discussion. Please provide your thoughts in the comments.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

4 Responses to Clouds of Hadoop

Leave a comment Cancel reply

Search

Recent Posts

Index

Tags

Follow me on Twitter

Archive

Email Subscription

RSS Links