我正在使用spark(python中的代码)从Hbase表读取数据,我在Hbase中有200K记录,但Spark只读取了138条记录。我仔细检查了我的代码,但仍然无法找到问题。我想在此处复制代码段:
confRead = {
"hbase.zookeeper.quorum": "196.0.0.2:2181",
"hbase.mapreduce.inputtable": "testTable",
"hbase.mapreduce.scan.columns":"cf1:raw64"
}
keyConvRead = "org.apache.spark.examples.pythonconverters.ImmutableBytesWritableToStringConverter"
valueConvRead = "org.apache.spark.examples.pythonconverters.HBaseResultToStringConverter"
hbase_rdd = sc.newAPIHadoopRDD(
"org.apache.hadoop.hbase.mapreduce.TableInputFormat",
"org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"org.apache.hadoop.hbase.client.Result",
keyConverter=keyConvRead,
valueConverter=valueConvRead,
conf=confRead)
hbaseRDD = hbase_rdd.map(download_Analize)
需要提一下,raw64列真的很大,大约1M-20M字节,它是用base64编码的,还有同一列族cf1中的其他一些小列,实际上我只有列族。在此先感谢!