Google: Pinky and the Brain
World Domination Plan 137:
Phase 3: Distract before Conquering
While everyone is building out their Hadoop farms[i] and adopting Map Reduce to establish elastic computing[ii], Google is busy building (and patenting) some strange new technologies with names from a Dick Tracy line up – Gears, Big Table and Chubby Locks. Also it is busily building the biggest mash up heretofore envisioned called Google Wave allowing individuals across the web to message and collaborate in real time … at a keystroke level[iii]!
With Map Reduce, Google was able to process the entire content of the web[iv]. However, map reduce is a batch technology that gains concurrency advantage by mapping resources to perform dedicated parallel tasks on batches of data[v]. In parallel computing parlance one might call this Same Task Multiple Data (STMD), where not only is the same task performed on multiple computers, but within a computer the task processes multiple instances (rows) of data in batch. The dual architecture of Multiple Tasks Same Data (MTSD) is also conceivable with map reduce but a more rare parallel computing strategy[vi].
So what does it take to move beyond this batch non-real-time limitation and be able to process any task, any time, anywhere at massive demand scale? The answer is Massive Distributive Computing (MDC). This requires the transition from Communicating Sequential Processes (CSP)[vii] to Communicating Concurrent Processes (CCP)[viii]. Don’t let the big words bother you. They will be all explained in a moment, but first what is concurrency.
Sequence in a Concurrent World
Remember in a previous post[ix] we likened demands on storage to that of a cash register in a super market. By separating storage into the equivalent of aisles, we can get concurrent processing advantage to reduce queues of customers, the analog to processes accessing storage. The assumption is that any of the cashiers with their registers can service any customer. Once the customer is at the register, performing checkout, the cashier should complete the checkout, without interruption, before moving on to the next customer.
The technical term is that checkout is an atomic transaction[x]. If the checkout was interrupted so that the cashier had to checkout a more important customer (called pre-emption), besides angering the pre-empted customer, the cashier must rollback to the state prior to checkout, as though the customer had never started the process (rollback on pre-emption). So when the peeved customer leaves the store in a tizzy, neither he nor the store has lost any money in the transaction. A similar thing would happen if at the end, the customer discovered she had no money to complete the transaction (rollback on exception).
Though pre-emption and rollback are relatively rare in a supermarket, these are frequent occurrences in processing data where there is large number of processes sharing resources such as storage or state information such as current storage capacity. To handle these cases, each process must receive a lock, similar to taking a number at the deli to get service[xi]. The customer is free to do something else but usually just waits to be called.
Waiting on the Internet Highway
So our first clue that Google is moving to massive distributive computing is Chubby Access Locks[xii]. Chubby is a mechanism that allows a process anywhere in the world to obtain a lock (take a number) on a resource anywhere else in the network[xiii].
In a typical database management system the shared data elements can be very fine grained (down to individual data elements). For a distributed system, the overhead of obtaining locks on individual elements is not practical, so blocks of data are more coarse grained or chunkier, hence locks for sequencing access to chubby chunks of data.
Not the obese chunks represented in a batch file processed by Map Reduce, but pleasantly zaftig such as perhaps an individual row of data in a bulk file. Furthermore we don’t have to wait for an accumulation of chubby chunks to make up one obese batch to get concurrency advantage in increase throughput. We can process the data chunks as they arrive – asynchronously!
The discussion so far is typical of the concerns covered under Communicating Sequential Processes (CSP). The communication comes from the mechanism the notifies a process when it can safely proceed in it’s access of a shared resource and sequential (or serialized) access to the resource is the consequence of the communication. This also defines the communications between processes. The process rendezvous with the resource process and communicate back and forth (called a protocol) until the initiating process is done with the resource and relinquishes the lock.
Though almost all books on concurrent programming begin with CSP, the result is not parallel execution but serial execution of tasks that share a common resource. As we found out, sequential processing reduces the concurrency advantage over performing processes in parallel without locks. To reduce sequencing and increase concurrent leverage and likewise increase throughput performance, one needs to reduce the need for locks as well as the occurrences of pre-emption and rollback. This is covered in the discipline of Communicating Concurrent Processes (CCP) that is based upon a different model of concurrency[xiv].
Rest Stops on the Internet Highway
Relational database systems such as Oracle or MySQL are archetypes of CSP systems where even with grid implementations they can quickly extinguish any concurrency advantage. These implement the Relational Model for data independence[xv]. Their objective is to break chunks of data into it’s basic elements such that a value is stored in one place only and referenced in all other places through unique identifiers called reference keys or indices. In this way a change in a data value propagates consistently to all references of that value.
From a concurrent processing view this is analogous to running a taxi cab company, where every time a taxi comes in, it is broken down into it’s component parts and stored in racks (tables). Then when a cab is sent out, it must be reassembled from the parts available in the racks. In most cases, this is not an attractive business model[xvi]. It is tempting to apply this model to software or data, where for example, newer parts can be swapped in or same parts used for different purposes (constructing more buses than taxis) to meet demand. But even in these cases it not appropriate to insist upon the objects (the chunks of data constructed from data elements) to be broken down into parts for storage each time. Especially if the object is just resting in cache, ready to be called upon soon to perform more work. Another strategy has to be considered.
So the second clue that Google has developed an MDC system is Big Table[xvii]. Big Table deals with chubby chunks of structured data that can be anything (but typically represents an object state) held in a flat table distributed similar to files in GFS[xviii]. It is a concurrency enabler in CCP[xix]. At least that is the potential. To realize the potential requires an entirely different method of defining and developing apps. The company that gets this right rules the world!
Agha, G., & Hewitt, C. (1985). Concurrent programming using actors: Exploiting large-scale parallelism . From SpringerLink: http://www.springerlink.com/content/y1wl7q3342006720/
Announcing Amazon Elastic MapReduce . (2009, 2-April). From Amazon News: http://aws.amazon.com/about-aws/whats-new/2009/04/02/announcing-amazon-elastic-mapreduce/
Burrows, M. (2006, 7-November). The Chubby Lock Service for Loosely-Coupled Distributed Systems. From Google Research Publications: http://labs.google.com/papers/chubby.html
Collmeyer, A. J. (1972). Implications of data independence on the architecture of database management systems. From ACM: http://doi.acm.org/10.1145/800295.811495
Dean, J., & Ghemawat, S. (2004, 15-December). MapReduce: Simplified Data Processing on Large Clusters. From Google Research Publications: http://labs.google.com/papers/mapreduce.html
Fay Chang, J. D. (2006, November). Bigtable: A Distributed Storage System for Structured Data. From Google Research Publications: http://labs.google.com/papers/bigtable.html
Hewitt, C., & Baker, H. (1977, 10-May). Laws for Communicating Parallel Processes. From MIT: http://mit.dspace.org/bitstream/handle/1721.1/41962/AI_WP_134A.pdf?sequence=1
Hoare, C. (1978, August). Communicating sequential processes. From ACM: http://portal.acm.org/citation.cfm?doid=359576.359585
Kraft, T. (2010, 02-July). Clouds of Hadoop: How Map Reduce Changed the World . From Mind Before You Mine: https://mindbeforeyoumine.com/2010/07/02/clouds-of-hadoop/
Kraft, T. (2010, 25-June). Send In the Clouds: Old Metaphor Gets New Life. From Mind Before You Mine: https://mindbeforeyoumine.com/2010/06/25/send-in-the-clouds/
Sullivan, H., & Bashkow, T. R. (1977). A large scale, homogeneous, fully distributed parallel machine, I. From ACM: http://doi.acm.org/10.1145/800255.810659
Wikipedia. (n.d.). Paxos Algorithm. From Wikipedia: http://en.wikipedia.org/wiki/Paxos_algorithm
Yahoo! Advances Hadoop From Science to the World’s Largest Internet Deployment to Mainstream . (2010, 29-June). From Yahoo! News: http://labs.yahoo.com/news/426
[iii] Yes, I know that Google has recently shelved the project. Like Pinky and the Brain, not all plans are successful, and this case the reason for failure is as illustrative and nuanced as a PatB episode. This will be discussed more later.
[vi] Besides this leads down the slippery slope to Multiple Tasks Multiple Data (MTSD) or even worst Different Tasks Different Data (DTDD), which is beyond the capabilities of MR leading to Massive Distributed Computing (MDC).
[x] The transaction completes without interruption or roles back to the initial state. There are no intermediate states allowed, hence the transition is elemental or atomic.
[xi] There are several schemes for coordinating access to a shared object (mutex, monitors, objects), but they all require sequencing the access to the shared resource and holding a key or token while accessing the resource which must be passed back at the end of the transaction.
[xiii] Chubby from the paper seems to be confined primarily to a cluster of approximately 10,000 servers, but these clusters can communicate within and across centers, theoretically Chubby as a Name Service can communicate outside of the clusters.
[xvi] Dr. Sullivan (1991) private communications with author
[xviii] The Google term for breaking up a monolithic database into multiple pieces is called sharding, where each piece distributed in the file system is called a shard.
[xix] Sharding allows data base transactions to be executed in parallel on different shards at the same time. So part of an efficient parallel data base access strategy is to attempt to reduce multiple hits on the same shard during a query.