Question

我想在批处理中从每个映射器向HBase表中插入N行。我当前知道有两种方法可以做到这一点：

创建Put个对象列表并使用HTable实例的put(List<Put> puts)方法，并确保禁用autoFlush参数。
使用TableOutputFormat类并使用context.write(rowKey, put)方法。

哪一个更好？

在第一种方式中，context.write()不是必需的，因为hTable.put(putsList)方法用于直接将数据放入表中。我的mapper类正在扩展Class Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>，那么我应该为KEYOUT和VALUEOUT使用哪些类？

第二种方式，我必须拨打context.write(rowKey, put) N次。有什么方法可以将context.write()用于Put操作列表吗？

使用MapReduce还有其他方法吗？

提前致谢。

Answer 1

我更喜欢第二种选择，其中批处理是自然的（不需要列表对于mapreduce ....有深刻的见解，请参阅我的第二点

1）您的第一个选项List<Put>通常用于Standalone Hbase Java客户端。在内部，它由hbase.client.write.buffer控制，如下面的一个配置xmls

<property>
         <name>hbase.client.write.buffer</name>
         <value>20971520</value> // around 2 mb i guess
 </property>

其默认值为2mb大小。一旦缓冲区被填满，它将刷新所有的put以实际插入到你的表中。这与BufferedMutator的方式相同，如＃2

中所述

2）关于第二个选项，如果您看到TableOutputFormat文档

org.apache.hadoop.hbase.mapreduce
Class TableOutputFormat<KEY>

java.lang.Object
org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
org.apache.hadoop.hbase.mapreduce.TableOutputFormat<KEY>
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

@InterfaceAudience.Public
@InterfaceStability.Stable
public class TableOutputFormat<KEY>
extends org.apache.hadoop.mapreduce.OutputFormat<KEY,Mutation>
implements org.apache.hadoop.conf.Configurable
Convert Map/Reduce output and write it to an HBase table. The KEY is ignored

<强> while the output value must be either a Put or a Delete instance

- 通过code查看此内容的其他方式如下所示。

/**
     * Writes a key/value pair into the table.
     *
     * @param key  The key.
     * @param value  The value.
     * @throws IOException When writing fails.
     * @see RecordWriter#write(Object, Object)
     */
    @Override
    public void write(KEY key, Mutation value)
    throws IOException {
      if (!(value instanceof Put) && !(value instanceof Delete)) {
        throw new IOException("Pass a Delete or a Put");
      }
      mutator.mutate(value);
    }
  }

结论：context.write（rowkey，putlist）API无法实现。

但是，BufferedMutator（来自上面代码中的mutator.mutate）说

Map/reduce jobs benefit from batching, but have no natural flush point. {@code BufferedMutator} receives the puts from the M/R job and will batch puts based on some heuristic, such as the accumulated size of the puts, and submit batches of puts asynchronously so that the M/R logic can continue without interruption.

所以你的批处理是自然的（使用BufferedMutator）如上所述

使用MapReduce

1 个答案: