我有一个程序读取Kafka并在spark中打印输出。我需要将此输出附加到单个文件中.. 我的代码写入文件夹。哪个spark写入多个文件,然后我有另一个实用程序,它将汇总文件的结果。
有没有简单的方法将来自多个RDD的DStream的数据附加到同一个文件? 要么 我可以将所有Dstream RDD组合到一个DStream并将其流/附加到文件
conf = SparkConf() \
.setAppName("PySpark Cassandra Test") \
.setMaster("spark://host:7077") \
.set("spark.rpc.netty.dispatcher.numThreads","2")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 20)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
parsed = kvs.map(lambda (k, v): json.loads(v))
mapped = parsed.map(lambda event: (event['test'], 1))
reduced = mapped.reduceByKey(lambda x,y: x + y)
result = reduced.map(lambda x: {"test": x[0], "test2": x[1]})
result.pprint()
result.saveAsTextFiles("file:///test/hack")
ssc.start()
ssc.awaitTermination()
答案 0 :(得分:0)
我可以使用foreachRDD
来做到这一点def tpprint(val, num=10):
"""
Print the first num elements of each RDD generated in this DStream.
@param num: the number of elements from the first will be printed.
"""
def takeAndPrint(time, rdd):
taken = rdd.take(num + 1)
print("########################")
print("Time: %s" % time)
print("########################")
for record in taken[:num]:
print(record)
with open("/home/ubuntu/spark-1.4.1/test.txt", "a") as myfile:
myfile.write(str(record))
if len(taken) > num:
print("...")
print("")
val.foreachRDD(takeAndPrint)
称之为 result = reduced.map(lambda x:{“feddback_id”:x [0],“pageviews”:x [1]})
tpprint(result)