Pyspark HBase批量加载org.apache.hadoop.hbase.client.Put无法转换为org.apache.hadoop.hbase.Cell

时间:2018-11-20 12:02:35

标签: apache-spark pyspark hbase

我正在尝试使用Hfiles从Pyspark批量加载到HBase,例如:https://stackoverflow.com/a/35077987/10585126

我的代码:

conf = {"hbase.zookeeper.qourum": host,\
        "zookeeper.znode.parent": "/hbase", \
        "hbase.mapred.outputtable": table,\
        "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
        "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
        "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}

keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"

def csv_to_key_value(row):
    puids = row.split(",")
    result = []
    for (num, puid) in list(enumerate(puids))[1:]:
        if puid:
            val_tup = (puids[0], [puids[0], "sg", 'seg'+str(num)+'value', str(puid)])
            result.append(val_tup)
            ids_tup = (puids[0], [puids[0], "sg", 'seg'+str(num)+'id', str(num)])
            result.append(ids_tup)
    return result


data = sc.textFile(path_to_hdfs)
load_rdd = data.flatMap(lambda line: line.split("\n")).flatMap(csv_to_key_value).sortByKey(True)
load_rdd.saveAsNewAPIHadoopFile(path + str(sc.startTime),
                                "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
                                conf=conf,
                                keyConverter=keyConv,
                                valueConverter=valueConv)

但无法克服java.lang.ClassCastException: org.apache.hadoop.hbase.client.Put cannot be cast to org.apache.hadoop.hbase.Cell的不足

有人遇到过这个问题吗? 我正在将pyspark 1.6.0(CDH 5.9.0)与hbase-examples-1.2.0-cdh5.9.0.jar和spark-examples-1.6.0-cdh5.9.0-hadoop2.6.0-cdh5.9.0.jar一起使用

p.s。使用Puts加载效果很好!

0 个答案:

没有答案