eygle.com   eygle.com
eygle.com eygle
eygle.com  
 

« 海量数据 10 PB 到底有多大? | 文摘首页 | NLS_DATE_FORMAT-RMAN不完全修复中的时间格式 »

Google处理1TB数据只需68秒 1PB六个小时
modb.pro

如何高效处理全球信息一直是Google的狂热追求,而其自行开发的C++编程工具MapReduce就在其中扮演着关键角色。它可以多线程同时执行大规模数据集(1TB以上)的并行运算,是日常超大计算量的完美方案。

Google今天自豪地宣布,他们只需要短短68秒钟就能完成对1TB数据的排序处理。这些数据都是未压缩的文本文件,使用Google File System文件系统存储在1000台计算机上。在此之前,Google处理分布于910台计算机上的等量数据需要耗时209秒,效率只有现在的三分之一 左右。

当然,信息爆炸时代的数据量远远不只是TB级别,更常见的是一千倍的PB级别。在今年1月份的时候,Google MapReduce平均每天的数据处理量是20PB,相当于美国国会图书馆今年5月份存档网络数据的240倍。

那么Google MapReduce对4000台计算机上的1PB数据进行排序处理需要多长时间呢?答案是6小时零2分钟。放眼全球,除了Google还没有谁具备这种高速处理能力。

Google还透露,这1PB数据是存储在48000个硬盘上的(当然并没有全部填满),不过考虑到测试的持续时间、涉及的硬盘数量、硬盘的使用寿命,每次进行测试都会有至少一块硬盘挂掉。为此Google文件系统会为每个文件备份三个拷贝,并分别放在三块硬盘上。

Link:http://www.cioage.com/art/200901/77364.htm

At Google we are fanatical about organizing the world's information. As a result, we spend a lot of time finding better ways to sort information using MapReduce, a key component of our software infrastructure that allows us to run multiple processes simultaneously. MapReduce is a perfect solution for many of the computations we run daily, due in large part to its simplicity, applicability to a wide range of real-world computing tasks, and natural translation to highly scalable distributed implementations that harness the power of thousands of computers.

In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark. Standardized experiments help us understand and compare the benefits of various technologies and also add a competitive spirit. You can think of it as an Olympic event for computations. By pushing the boundaries of these types of programs, we learn about the limitations of current technologies as well as the lessons useful in designing next generation computing platforms. This, in turn, should help everyone have faster access to higher-quality information.

We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.

Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try. One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.

It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers. We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly.

An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks). To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.

Significantly improved handling of the so-called "stragglers" (parts of computation that run slower than expected) was a key software technique that helped sort 1PB. And of course, there are many other factors that contributed to the result. We'll be discussing all of this and more in an upcoming publication. And you can also check out the video from our recent Technology RoundTable Series.

Link:http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html


历史上的今天...
    >> 2010-03-17文章:
    >> 2006-03-17文章:

By eygle on 2011-03-17 10:09 | Comments (18) | IT新闻 | 2760 |


CopyRight © 2004~2020 云和恩墨,成就未来!, All rights reserved.
数据恢复·紧急救援·性能优化 云和恩墨 24x7 热线电话:400-600-8755 业务咨询:010-59007017-7040 or 7037 业务合作: marketing@enmotech.com