Question

在MapReduce papers中描述了输入文件在M输入分割中被分区。我知道Hadoop中的HDFS会自动分区为64 MB的块（默认），然后将这些块复制到群集中的其他几个节点以提供容错功能。我想知道HDFS中文件的这种分区是否意味着在上述MapReduce论文中描述的输入分割。这种分裂的容错是单一的原因还是有更重要的原因？

如果我在没有分布式文件系统的节点集群上有MapReduce（仅在具有公共文件系统的本地磁盘上的数据）怎么办？在映射阶段之前，是否需要在本地磁盘上拆分输入文件？

感谢您的回答。

Answer 1

想添加一些缺失的概念（ans让我感到困惑）

HDFS

文件存储为块（ Fault / Node Tolerance ）。块大小（64MB-128MB）64MB。因此，文件按块分割，块存储在群集上的不同节点上。正在通过复制因子复制块（默认值= 3）。

<强>地图，减少

已存储在HDFS中的文件在逻辑上划分为 INPUT-SPLITS 。分割大小可以由用户设置

Property name           Type   Default value

mapred.min.split.size   int     1
mapred.max.split.sizea  long    Long.MAX_VALUE.

然后分割大小由公式计算：

max（minimumSize，min（maximumSize，blockSize））

注意：：拆分是合乎逻辑的

希望现在回答你的问题

 I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers.

不，完全没有HDFS块和Map-Reduce分割是一回事。

Is fault tolerance single reason of this splitting or are there more important reasons?

不，分布式计算将是原因。

And what if I have MapReduce over cluster of nodes without distributed file system (data only on local disks with common file sytem)? Do I need to split input files on local disk before map phase?

在你的情况下，我猜，是的，你必须拆分Map Phase的输入文件，而且你必须将中间输出（来自Mapper）拆分为Reduce Phase。其他概率：数据的一致性，容错，数据丢失（在hadoop中= 1％）。

Map-Reduce用于分布式计算，因此在非分布式环境中使用Map-Reduce是没有用的。

谢谢

Answer 2

I'd like to know if this partitioning of files in HDFS means the input splitting described in mentioned MapReduce papers.

不，MapReduce中的输入拆分是在缩减阶段利用多个处理器的计算能力。映射器接收大量数据并将数据拆分为逻辑分区（大多数时间由程序员自定义的映射器实现指定）。然后，这些数据将转移到各个节点，其中称为reducers的独立进程执行数据处理，然后，结果将最终进行整理。

Is fault tolerance single reason of this splitting or are there more important reasons?

不，这不是这样做的唯一原因。您可以将其与文件系统级块大小进行比较，以确保将数据传输到块，基于每个块压缩数据以及分配I / O缓冲区。

MapReduce中输入拆分的主要原因是什么？

2 个答案: