Question

我正在开发一个MapReduce程序，我需要将实体插入到数据库中。由于某些性能问题，将实体插入数据库应该在组合器中完成。我的程序没有reducer，所以只有mapper和combiner。由于Hadoop引擎可能无法执行组合器（组合器是可选的），我如何强制它来运行组合器？

Answer 1

MapReduce框架没有提供强制执行组合器的支持方式。组合器可以被称为0,1或多次。该框架可以自由地做出决定。

当前实现决定在映射任务执行期间基于溢出到磁盘运行组合器。 mapred-default.xml的Apache Hadoop文档记录了几个可能对溢出活动产生影响的配置属性。

<property>
  <name>mapreduce.map.sort.spill.percent</name>
  <value>0.80</value>
  <description>The soft limit in the serialization buffer. Once reached, a
  thread will begin to spill the contents to disk in the background. Note that
  collection will not block if this threshold is exceeded while a spill is
  already in progress, so spills may be larger than this threshold when it is
  set to less than .5</description>
</property>

<property>
  <name>mapreduce.task.io.sort.factor</name>
  <value>10</value>
  <description>The number of streams to merge at once while sorting
  files.  This determines the number of open file handles.</description>
</property>

<property>
  <name>mapreduce.task.io.sort.mb</name>
  <value>100</value>
  <description>The total amount of buffer memory to use while sorting 
  files, in megabytes.  By default, gives each merge stream 1MB, which
  should minimize seeks.</description>
</property>

此外，还有一个未记录的配置属性mapreduce.map.combine.minspills，它定义了运行组合器之前所需的最小溢出次数。如果未指定，则默认值为3。

可能只需调整这些配置属性就可以设置触发足够溢出超过mapreduce.map.combine.minspills的条件，从而保证至少有一次对组合器的调用。但是，我不能推荐，因为它会非常脆弱。逻辑对外部因素非常敏感，例如输入数据的大小。此外，它将依赖于当前MapReduce代码库的特定实现细节。内部算法可能会发生变化，这些变化可能会破坏您的假设。实际上没有强制组合器运行的公共API。

此外，请记住，与减速器不同，组合器可能无法全面了解与特定键关联的所有值。如果多个map任务处理具有相同键的记录，则reducer是唯一保证将所有这些值组合在一起的地方。即使在单个地图任务中，组合器也可以执行多次，其中从其处理的输入分割中提取的键值的不同子集。

对于将数据从Hadoop导出到关系数据库的问题的更标准解决方案，请考虑DBOutputFormat或Sqoop。

如何强制执行mapreduce程序来执行组合器？

1 个答案: