Question

我是HBase的新手，遇到了一个问题，我无法在Google中找到答案。

我正尝试使用咸化表方法将数据从Hive批量插入HBase，如https://www.opencore.com/blog/2016/10/efficient-bulk-load-of-hbase-using-spark/

中所述

唯一的问题是我需要插入多列数据。 Hive中的表具有以下列：代码，描述，Total_emp，薪金

我试图将完全相同的列插入HBase。 HBase表如下：

'test2', {TABLE_ATTRIBUTES => {METADATA => {'SPLIT_POLICY' => 'org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy'}}, {NAME => 'epsg_3857', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', KEEP_DELETED_CELLS => 'FALSE', CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'FAST_DIFF', TTL => 'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', BLOCKCACHE => 'true', BLOCKSIZE => '1000000', METADATA => {'NEW_VERSION_BEHAVIOR' => 'false'}}

但是，当我将薪水列插入HFile时，我仍然遇到此错误：

java.io.IOException: Added a key not lexically larger than previous. Current cell = 0:0:11-1011/epsg_3857:Salary/1557231349613/Put/vlen=6/seqid=0, lastCell = 0:0:11-1011/epsg_3857:Total/1557231349613/Put/vlen=6/seqid=0

如果我删除薪水列或将薪水列移到新的列族中，我就可以创建HFile。但是，这不应该是因为我已经读过一个单列系列能够容纳许多列。

我尝试将块大小从默认值增加到1MB，仍然是同样的问题。

下面是我的测试代码：

import Salter.Salts._
import org.apache.spark.sql.SparkSession
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2

object SaltedKeyExample2 extends App {
  System.setProperty("HADOOP_USER_NAME", "cloudera")

  val hive_session = SparkSession
    .builder()
    .appName("Salted Key Example 2")
    .master("local[*]")
    .config("spark.submit.deployMode", "client")
    .config("spark.yarn.jars", "hdfs://192.168.30.12:8020/user/cloudera/jars/*.jar")
    //.config("hive.metastore.uris", "thrift://192.168.30.12:9083")
    .enableHiveSupport()
    .getOrCreate()

  import hive_session.sql

  val df_07 = sql("SELECT * from sample_07")
  val df_08 = sql("SELECT * from sample_08")
  df_07.filter(df_07("salary") > 150000).show()
  val df_09 = df_07.join(df_08, df_07("code") === df_08("code")).select(df_07.col("code"), df_07.col("description"))
  //val sourceRDD = df_09.rdd
  val sourceRDD = df_07.filter(df_07("salary") > 150000).rdd
  df_09.show()

  val spp = new SaltPrefixPartitioner(modulus = 2)

  val saltedRDD = sourceRDD.flatMap(r => {Seq((salt(r.getString(0), 2), (r.getString(0), r.get(1), r.get(2), r.get(3))))})

  saltedRDD.foreach(x => println(x))

  val partitionedRDD = saltedRDD.repartitionAndSortWithinPartitions(spp)

  partitionedRDD.foreach(x => println(x))

  val cells = saltedRDD.sortByKey(true).flatMap(r => {
    val salted_keys = salt(r._1, 2)
    val codes = r._2._1.toString()
    val descriptions = r._2._2.toString()
    val total = r._2._3.toString()
    val salary = r._2._4.toString()
    val colFamily = "epsg_3857"
    val colFamily2 = "epsg_3857_2"
    val colNameCodes = "Code"
    val colNameDesc = "Description"
    val colNameTotal = "Total"
    val colNameSalary = "Salary"

    Seq((new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), colNameCodes.getBytes(), codes.getBytes())),
      (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), colNameDesc.getBytes(), descriptions.getBytes())),
      (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), colNameTotal.getBytes(), total.getBytes())),
      (new ImmutableBytesWritable(Bytes.toBytes(salted_keys)), new KeyValue(Bytes.toBytes(salted_keys), colFamily.getBytes(), colNameSalary.getBytes(), salary.getBytes())))
  })

  cells.foreach(x => println(x))

  // setup the HBase configuration
  val baseConf = HBaseConfiguration.create(hive_session.sparkContext.hadoopConfiguration)

  // NOTE: job creates a copy of the conf
  val job = Job.getInstance(baseConf, "test2")
  val connection = ConnectionFactory.createConnection(baseConf)
  val table = connection.getTable(TableName.valueOf("test2"))
  val regionLoc = connection.getRegionLocator(table.getName)
  cells.foreach(x => println(x))
  // Major gotcha(!) - see comments that follow
  HFileOutputFormat2.configureIncrementalLoad(job, table, regionLoc)

  val conf = job.getConfiguration // important(!)

  cells.foreach(x => println(x))
  // write HFiles onto HDFS
  cells.saveAsNewAPIHadoopFile(
    "/tmp/test/hfiles",
    classOf[ImmutableBytesWritable],
    classOf[KeyValue],
    classOf[HFileOutputFormat2],
    conf)

  println("hello")
}

我希望在HBase的列族中插入3个以上的列，但现实情况表明我目前不能。感谢任何帮助解决此问题的方法。谢谢。

一个列族不能插入3个以上的列

0 个答案: