MapReduce Stats and Cloud Computing
Google just published an updated version of their MapReduce paper, that can be found at ACM. It gives some new information and statistics about the usage of the MapReduce software framework that supports "parallel computations over large (...) data sets on unreliable clusters of computers". Mapreduce was introduced in 2003 to be used for indexing the web and computing the PageRank, as well as for processing geographic information in Google Maps, clustering news articles, machine translation, Google Trends etc..
403,152 TB (terabytes) of input data were used in September 2007. The average number of machines allocated for a MapReduce job was 394, while the average completion time was 6 minutes and a half.
Below you see some more statistics for the MapReduce usage, which shows the enormous amount of information and data processed by Google.
| Aug. '04 | Mar. '06 | Sep. '07 | |
| Number of jobs (1000s) | 29 | 171 | 2,217 |
| Avg. completion time (secs) | 634 | 874 | 395 |
| Machine years used | 217 | 2,002 | 11,081 |
| map input data (TB) | 3,288 | 52,254 | 403,152 |
| map output data (TB) | 758 | 6,743 | 34,774 |
| reduce output data (TB) | 193 | 2,970 | 14,018 |
| Avg. machines per job | 157 | 268 | 394 |
Unique implementations |
|||
| map | 395 | 1958 | 4083 |
| reduce | 269 | 1208 | 2418 |
I would like to add that there are different implementations of MapReduce available. One is Hadoop, a open source implementation of MapReduce and the second is dryad fom Microsoft, which is still an ongoing research project at the microsoft labs.
Google's computing infrastructure is one of their most important competive advantage. It consists of a big computing cluster configuration based on PC's with two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link (assuming the configuration details that were revealed in 2004). This configuration enables Google to quickly add new computing power to their grid. This type of computing is called cloud computing and Google and it seems to be the next big switch after the PC revolution. So it's not surprising that other companies like Amazon, Salesforce and Microsoft have similar ventures on the way. Recently Microsoft just announced to built a big datacenter in Sibiria and Amazon is activly promoting it's S3 service.
Nicholas Carr has written up a short article on that in the financial times about cloud computing and the shift in to utility computing. He draws a comparison to power generation and energy industry.
"Until the end of the 19th century, businesses had to run their own power-generating facilities, producing all the energy required to run their machinery.
[...] Like data-processing today, power generation was assumed to be an intrinsic part of doing business. But with the invention of the alternating-current electric grid at the turn of the century, that assumption was overturned. Suddenly, manufacturers did not have to be in the power-generation business."
I'm not completely sure if this comparison holds as information in terms of intellectual property and customer data seem to be more sensitive and crucial for the business success than does electrical power. But I'm sure that there will be a plethora of application and business that will utilize this kind of computing.











