无法读取Spark流数据

时间:2016-01-29 06:34:28

标签: python spark-streaming

我正在尝试使用Spark Python读取流数据,并更改流数据的数据格式。但似乎我甚至无法阅读流......

以下是我的步骤:

  1. 我打开一个终端,cd到输入数据文件夹,然后输入命令行

    ls part-* | xargs -I % sh -c '{ cat %; sleep 5;}' | nc -lk 9999
    
  2. 然后我打开另一个终端,键入setenv SPARK_HOME /user/abc/Downloads/spark-1.5.2-bin-hadoop2.6/,以便我可以在本地运行Spark。然后我输入命令${SPARK_HOME}/bin/spark-submit --master local /user/abc/test.py localhost 9999来运行我的代码。

  3. 下面是代码,我只是测试我是否正在读取流数据然后更改数据格式......但它总是显示错误:16/01/28 22:41:37 INFO ReceiverSupervisorImpl: Starting receiver 16/01/28 22:41:37 INFO ReceiverSupervisorImpl: Called receiver onStart 16/01/28 22:41:37 INFO ReceiverSupervisorImpl: Receiver started again 16/01/28 22:41:37 INFO SocketReceiver: Connecting to localhost:9999 16/01/28 22:41:37 INFO SocketReceiver: Connected to localhost:9999 16/01/28 22:41:37 INFO SocketReceiver: Closed socket to localhost:9999 16/01/28 22:41:37 WARN ReceiverSupervisorImpl: Restarting receiver with delay 2000 ms: Socket data stream had no more data

    如果我重新运行ls part-* | xargs -I % sh -c '{ cat %; sleep 5;}' | nc -lk 9999,它仍会显示相同的错误....你知道如何解决问题吗?

    import sys
    import re
    
    from pyspark import SparkContext
    from pyspark.sql.context import SQLContext
    from pyspark.sql import Row
    from pyspark.streaming import StreamingContext
    
    
    sc = SparkContext(appName="test")
    ssc = StreamingContext(sc, 5)
    sqlContext = SQLContext(sc)
    
    
    def get_tuple(r):
        m = re.search('\[(.*?)\]',r)
        s = m.group(1)
        fs = s.split(',')
        for i in range(len(fs)):
            if i > 1:
                fs[i] = float(fs[i])
        return fs
    
    
    def main():
        indata = ssc.socketTextStream(sys.argv[1], int(sys.argv[2]))
        inrdd = indata.map(lambda r: get_tuple(r))
        Features = Row('feature_vec')
        features_rdd = inrdd.map(lambda r: Features(r))
        features_rdd.pprint(num=10)
    
        ssc.start()
        ssc.awaitTermination()
    
    if __name__ == "__main__":
        main()
    

1 个答案:

答案 0 :(得分:0)

问题解决了。 Spark命令行应为Spark流添加[*],如下所示:

${SPARK_HOME}/bin/spark-submit --master local[*] /user/abc/test.py localhost 9999

然后输出