为什么saveAsNewAPIHadoopDataset无法与PySpark,Spark Streaming和Hbase一起使用并且不返回任何错误?

时间:2019-04-16 08:43:22

标签: python apache-spark pyspark hbase spark-streaming

我尝试根据this tutorial通过PySpark将实时Kafka数据摄取配置到HBase。我对下面显示的代码有疑问。当我运行它时,我得到这样的输出:

APPNAME:Kafka_MapR-Streams_to_HBase
APPID:local-1553526448
VERSION:2.4.0
=====Pull from Stream=====
saveAsNewAPIHadoopDataset

我以这种格式流数据:

3926426402421,OCT 23 10:23:39 {nat}[FWNAT]: STH 129.15.90.22:1404 [34.62.15.31:086] -> 170.14.183.168:63 UDP

我没有收到任何错误等信息,但是没有将数据添加到Hbase表(create "logs","log")中,也没有执行此行代码print('saveAsNewAPIHadoopDataset2')。我将batchDuration等式设置为1秒,例如,当我打印print(len(rdd.collect()))时,一切似乎都正常。我得到的值是400000等。有什么想法吗?

def SaveToHBase(rdd):

    print("=====Pull from Stream=====")

    if not rdd.isEmpty():

        host = 'myhost'  
        table = 'logs'  
        keyConv = "org.apache.spark.examples.pythonconverters.StringToImmutableBytesWritableConverter"  
        valueConv = "org.apache.spark.examples.pythonconverters.StringListToPutConverter"  
        conf = {"hbase.zookeeper.quorum": host,
            "hbase.mapred.outputtable": table,
            "mapreduce.outputformat.class": "org.apache.hadoop.hbase.mapreduce.TableOutputFormat",  
            "mapreduce.job.output.key.class": "org.apache.hadoop.hbase.io.ImmutableBytesWritable",  
            "mapreduce.job.output.value.class": "org.apache.hadoop.io.Writable"} 

        print('saveAsNewAPIHadoopDataset1')
        rdd.saveAsNewAPIHadoopDataset(conf=conf,keyConverter=keyConv,valueConverter=valueConv) 
        print('saveAsNewAPIHadoopDataset2')


parsed = kds.filter(lambda x: x != None and len(x) > 0 )
parsed = parsed.map(lambda x: x[1])
parsed = parsed.map(lambda rec: rec.split(","))
parsed = parsed.filter(lambda x: x != None and len(x) == 2 )
parsed = parsed.map(lambda data:Row(log_id=getValue(str,data[0]), \
        log=getValue(str,data[1])))
parsed = kds.filter(lambda x: x != None and len(x) > 0 )



parsed.foreachRDD(SaveToHBase)

0 个答案:

没有答案