Question

我目前正在使用Python将CSV数据批量加载到HBase表中，而我目前在使用saveAsNewAPIHadoopFile编写适当的HFile时遇到问题

我的代码目前看起来如下：

def csv_to_key_value(row):
    cols = row.split(",")
    result = ((cols[0], [cols[0], "f1", "c1", cols[1]]),
              (cols[0], [cols[0], "f2", "c2", cols[2]]),
              (cols[0], [cols[0], "f3", "c3", cols[3]]))
    return result

def bulk_load(rdd):
    conf = {#Ommitted to simplify}

    keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
    valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"

    load_rdd = rdd.flatMap(lambda line: line.split("\n"))\
                  .flatMap(csv_to_key_value)
    if not load_rdd.isEmpty():
        load_rdd.saveAsNewAPIHadoopFile("file:///tmp/hfiles" + startTime,
                                        "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
                                        conf=conf,
                                        keyConverter=keyConv,
                                        valueConverter=valueConv)
    else:
        print("Nothing to process")

当我运行此代码时，出现以下错误：

java.io.IOException: Added a key not lexically larger than previous. Current cell = 10/f1:c1/1453891407213/Minimum/vlen=1/seqid=0, lastCell = /f1:c1/1453891407212/Minimum/vlen=1/seqid=0 at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.checkKey(AbstractHFileWriter.java:204)

由于错误表明密钥是问题，我抓住了RDD中的元素，它们如下（格式化以便于阅读）

[(u'1', [u'1', 'f1', 'c1', u'A']),
 (u'1', [u'1', 'f2', 'c2', u'1A']),
 (u'1', [u'1', 'f3', 'c3', u'10']),
 (u'2', [u'2', 'f1', 'c1', u'B']),
 (u'2', [u'2', 'f2', 'c2', u'2B']),
 (u'2', [u'2', 'f3', 'c3', u'9']),

。。

 (u'9', [u'9', 'f1', 'c1', u'I']),
 (u'9', [u'9', 'f2', 'c2', u'3C']),
 (u'9', [u'9', 'f3', 'c3', u'2']),
 (u'10', [u'10', 'f1', 'c1', u'J']),
 (u'10', [u'10', 'f2', 'c2', u'1A']),
 (u'10', [u'10', 'f3', 'c3', u'1'])]

按照正确的顺序，这是我的CSV的完美匹配。据我所知，在HBase中，密钥由{row，family，timestamp}定义。对于我的数据中的所有条目，行和族是组合是唯一且单调递增的，并且我无法控制时间戳（这是我能想象到的唯一问题）

有人可以告诉我如何避免/预防此类问题吗？

Answer 1

这对我来说只是一个愚蠢的错误，我觉得有点愚蠢。按字典顺序，订单应为1,10,2,3 ... 8,9。加载前保证正确订购的最简单方法是：

rdd.sortByKey(true);

我希望我至少可以挽救一个人头痛的事。

Spark Streaming - HBase批量加载

1 个答案: