将spark Dstream附加到python

时间:2016-06-16 16:20:46

标签: python apache-spark pyspark

我有一个程序读取Kafka并在spark中打印输出。我需要将此输出附加到单个文件中.. 我的代码写入文件夹。哪个spark写入多个文件,然后我有另一个实用程序,它将汇总文件的结果。

有没有简单的方法将来自多个RDD的DStream的数据附加到同一个文件? 要么 我可以将所有Dstream RDD组合到一个DStream并将其流/附加到文件

    conf = SparkConf() \
         .setAppName("PySpark Cassandra Test") \
         .setMaster("spark://host:7077") \
         .set("spark.rpc.netty.dispatcher.numThreads","2")

    sc = SparkContext(conf=conf)
    ssc = StreamingContext(sc, 20)

    zkQuorum, topic = sys.argv[1:]
    kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
    parsed = kvs.map(lambda (k, v): json.loads(v))
    mapped = parsed.map(lambda event: (event['test'], 1))
    reduced = mapped.reduceByKey(lambda x,y: x + y)
    result = reduced.map(lambda x: {"test": x[0], "test2": x[1]})
    result.pprint()
    result.saveAsTextFiles("file:///test/hack")
    ssc.start()
    ssc.awaitTermination()

1 个答案:

答案 0 :(得分:0)

我可以使用foreachRDD

来做到这一点
def tpprint(val, num=10):
    """
    Print the first num elements of each RDD generated in this DStream.
    @param num: the number of elements from the first will be printed.
    """
    def takeAndPrint(time, rdd):
        taken = rdd.take(num + 1)
        print("########################")
        print("Time: %s" % time)
        print("########################")
        for record in taken[:num]:
            print(record)
            with open("/home/ubuntu/spark-1.4.1/test.txt", "a") as myfile:
                myfile.write(str(record))
        if len(taken) > num:
            print("...")
        print("")

    val.foreachRDD(takeAndPrint)

称之为     result = reduced.map(lambda x:{“feddback_id”:x [0],“pageviews”:x [1]})

tpprint(result)