我尝试根据this tutorial通过PySpark将实时Kafka数据摄取配置到HBase。我对下面显示的代码有疑问。当我运行它时,我得到这样的输出:
APPNAME:Kafka_MapR-Streams_to_HBase
APPID:local-1553526448
VERSION:2.4.0
=====Pull from Stream=====
saveAsNewAPIHadoopDataset
我以这种格式流数据:
3926426402421,OCT 23 10:23:39 {nat}[FWNAT]: STH 129.15.90.22:1404 [34.62.15.31:086] -> 170.14.183.168:63 UDP
我没有收到任何错误等信息,但是没有将数据添加到Hbase表(create "logs","log"
)中,也没有执行此行代码print('saveAsNewAPIHadoopDataset2')
。我将batchDuration
等式设置为1
秒,例如,当我打印print(len(rdd.collect()))
时,一切似乎都正常。我得到的值是400000等。有什么想法吗?
def SaveToHBase(rdd):
print("=====Pull from Stream=====")
if not rdd.isEmpty():
host = 'myhost'
table = 'logs'
keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"
valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"
conf = {"hbase.zookeeper.quorum": host,
"hbase.mapred.outputtable": table,
"mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",
"mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",
"mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"}
print('saveAsNewAPIHadoopDataset1')
rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv)
print('saveAsNewAPIHadoopDataset2')
parsed = kds.filter(lambda x: x != None and len(x) > 0 )
parsed = parsed.map(lambda x: x[1])
parsed = parsed.map(lambda rec: rec.split(","))
parsed = parsed.filter(lambda x: x != None and len(x) == 2 )
parsed = parsed.map(lambda data:Row(log_id=getValue(str,data[0]), \
log=getValue(str,data[1])))
parsed = kds.filter(lambda x: x != None and len(x) > 0 )
parsed.foreachRDD(SaveToHBase)