我正在尝试使用Hfiles从Pyspark批量加载到HBase,例如:https://stackoverflow.com/a/35077987/10585126
我的代码:
conf = {"hbase.zookeeper.qourum": host,\
"zookeeper.znode.parent": "/hbase", \
"hbase.mapred.outputtable": table,\
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",\
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",\
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
def csv_to_key_value(row):
puids = row.split(",")
result = []
for (num, puid) in list(enumerate(puids))[1:]:
if puid:
val_tup = (puids[0], [puids[0], "sg", 'seg'+str(num)+'value', str(puid)])
result.append(val_tup)
ids_tup = (puids[0], [puids[0], "sg", 'seg'+str(num)+'id', str(num)])
result.append(ids_tup)
return result
data = sc.textFile(path_to_hdfs)
load_rdd = data.flatMap(lambda line: line.split("\n")).flatMap(csv_to_key_value).sortByKey(True)
load_rdd.saveAsNewAPIHadoopFile(path + str(sc.startTime),
"org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2",
conf=conf,
keyConverter=keyConv,
valueConverter=valueConv)
但无法克服java.lang.ClassCastException: org.apache.hadoop.hbase.client.Put cannot be cast to org.apache.hadoop.hbase.Cell
的不足
有人遇到过这个问题吗? 我正在将pyspark 1.6.0(CDH 5.9.0)与hbase-examples-1.2.0-cdh5.9.0.jar和spark-examples-1.6.0-cdh5.9.0-hadoop2.6.0-cdh5.9.0.jar一起使用
p.s。使用Puts加载效果很好!