Question

我正在尝试使用HBase在BulkLoad中加载数据。我也在使用Scala和Spark来编写代码。但每次数据只在一个区域加载。我需要将它加载到多个区域。我使用了下面的代码 -

Hbase配置：

def getConf: Configuration = {
    val hbaseSitePath = "/etc/hbase/conf/hbase-site.xml"
    val conf = HBaseConfiguration.create()
    conf.addResource(new Path(hbaseSitePath))
    conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 100)
    conf   
}

我可以使用上述配置在一个区域内加载80GB的数据。

但是当我尝试在多个区域中加载相同数量的数据时，下面提到的配置会出现异常

java.io.IOException：尝试将超过32个hfiles加载到一个系列一个地区

更新配置 -

def getConf: Configuration = {

  val conf = HBaseConfiguration.create()
  conf.addResource(new Path(hbaseSitePath))
  conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 32)

  conf.setLong("hbase.hregion.max.filesize", 107374182)
  conf.set("hbase.regionserver.region.split.policy","org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy")
  conf
}

为了保存记录，我使用的是代码 -

val kv = new KeyValue(Bytes.toBytes(key), columnFamily.getBytes(),
        columnName.getBytes(), columnValue.getBytes())

      (new ImmutableBytesWritable(Bytes.toBytes(key)), kv)

rdd.saveAsNewAPIHadoopFile(pathToHFile, classOf[ImmutableBytesWritable], classOf[KeyValue],
      classOf[HFileOutputFormat2], conf) //Here rdd is the input

    val loadFiles = new LoadIncrementalHFiles(conf)
    loadFiles.doBulkLoad(new Path(pathToHFile), hTable)

需要帮助。

Answer 1

您遇到问题，因为32是每个区域的默认值。您应该定义KeyPrefixRegionSplitPolicy来拆分文件，然后增加hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily，如下所示

conf.setInt("hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily", 1024)

还要尝试将文件大小增加为

  conf.setLong("hbase.hregion.max.filesize", 107374182)

HBASE批量加载单个表的多个区域

1 个答案: