Question

处理1TB文件的速度更快：一台机器或5台联网机器机器吗？（“处理”是指找到单个UTF-16字符在该1TB文件中出现次数最多）。数据率传输速率为1Gbit / sec，整个1TB文件驻留在1台计算机上，并且每台计算机都有一个四核CPU。

下面是我尝试使用long数组（数组大小为2 ^ 16）来跟踪字符数。这应该适合单个机器的内存，因为2 ^ 16 x 2 ^ 3（长的大小）= 2 ^ 19 = 0.5MB。任何帮助（链接，评论，建议）将不胜感激。我使用了Jeff Dean引用的延迟时间，我尽力使用我所知道的最佳近似值。最后的答案是：

单机：5.8小时（由于从磁盘读取缓慢）
5台联网机器：7.64小时（由于从磁盘和网络读取）

1) Single Machine
 a) Time to Read File from Disk --> 5.8 hrs
   -If it takes 20ms to read 1MB seq from disk, 
    then to read 1TB from disk takes: 
    20ms/1MB x 1024MB/GB x 1024GB/TB = 20,972 secs 
    = 350 mins = 5.8 hrs 

 b) Time needed to fill array w/complete count data 
    --> 0 sec since it is computed while doing step 1a
    -At 0.5 MB, the count array fits into L2 cache. 
     Since L2 cache takes only 7 ns to access, 
     the CPU can read & write to the count array 
     while waiting for the disk read. 
     Time: 0 sec since it is computed while doing step 1a

 c) Iterate thru entire array to find max count --> 0.00625ms
   -Since it takes 0.0125ms to read & write 1MB from 
    L2 cache and array size is 0.5MB, then the time 
    to iterate through the array is: 
    0.0125ms/MB x 0.5MB = 0.00625ms  

 d) Total Time 
    Total=a+b+c=~5.8 hrs (due to slowness of reading from disk)

2) 5 Networked Machines   
   a) Time to transfr 1TB over 1Gbit/s --> 6.48 hrs
      1TB x 1024GB/TB x 8bits/B x 1s/Gbit 
      = 8,192s = 137m = 2.3hr
      But since the original machine keeps a fifth of the data, it
      only needs to send (4/5)ths of data, so the time required is: 
      2.3 hr x 4/5 = 1.84 hrs
      *But to send the data, the data needs to be read, which
       is (4/5)(answer 1a) = (4/5)(5.8 hrs) = 4.64 hrs
       So total time = 1.84hrs + 4.64 hrs = 6.48 hrs

   b) Time to fill array w/count data from original machine --> 1.16 hrs
      -The original machine (that had the 1TB file) still needs to
       read the remainder of the data in order to fill the array with
       count data. So this requires (1/5)(answer 1a)=1.16 hrs.  
       The CPU time to read & write to the array is negligible, as 
       shown in 1b.      

   c) Time to fill other machine's array w/counts --> not counted   
      -As the file is being transferred, the count array can be 
       computed. This time is not counted. 

   d) Time required to receive 4 arrays --> (2^-6)s
      -Each count array is 0.5MB
       0.5MB x 4 arrays x 8bits/B x 1s/Gbit 
       = 2^20B/2 x 2^2 x 2^3 bits/B x 1s/2^30bits 
       = 2^25/2^31s = (2^-6)s 

   d) Time to merge arrays  
      --> 0 sec(since it can be merge while receiving)

   e) Total time 
      Total=a+b+c+d+e =~ a+b =~ 6.48 hrs + 1.16 hrs = 7.64 hrs

Answer 1

这不是一个答案，只是一个较长的评论。您错误估算了频率数组的大小。 1个TiB文件包含550个Gsyms，因为没有关于它们预期的频率的说法，你需要一个至少64位整数的计数数组（即8个字节/元素）。由于计算错误，此频率数组的总大小为2^16 * 8 = 2^19字节或仅为512 KiB而不是4 GiB。通过1 Gbps链路发送此数据只需要≈4.3ms（如果使用TCP / IP over Ethernet，协议头大约需要3％，MTU为1500字节/更少，使用巨型帧，但它们不受广泛支持/）。此阵列大小也非常适合CPU缓存。

您严重高估了处理数据和提取频率所需的时间，而且您还忽略了它可以重叠磁盘读取的事实。实际上，更新驻留在CPU高速缓存中的频率数组是如此之快，以至于计算时间可以忽略不计，因为大多数频率数组都会与慢速磁盘读取重叠。但是你低估了读取数据所需的时间。即使使用多核CPU，您仍然只有一条通往硬盘的路径，因此您仍需要完整的5.8小时来读取单机情况下的数据。

实际上，这是一种简单的数据处理方式，它既不会受益于并行网络处理，也不会有多个CPU核心。这就是超级计算机和其他快速网络处理系统使用分布式并行文件存储的原因，这些存储可以提供许多GB / s的聚合读/写速度。

Answer 2

如果你的源机器是5的一部分，你只需要发送0.8tb。

将数据发送到其他计算机甚至没有意义。考虑一下：

为了让源机器发送数据，它必须首先命中磁盘，以便在通过网络发送数据之前将数据读入主存储器。如果数据已经在主存储器中而没有被处理，那么您就是在浪费这个机会。

所以假设加载到CPU缓存远比磁盘到内存或网络上的数据更便宜（这是真的，除非你正在处理外来硬件），那么你最好只是在源机器上执行它，并且唯一分割任务的地方是有意义的，如果“文件”以某种方式以分布式方式创建/填充以开始。

因此，您应该只计算1Tb文件的磁盘读取时间，而L1 / L2缓存和CPU操作只需要很少的开销。缓存访问模式是最佳的，因为它是顺序的，因此每个数据只会缓存一次。

这里的主要观点是磁盘是影响其他一切的主要瓶颈。

哪个处理1TB文件更快：一台机器还是5台联网机器？

2 个答案: