Question

我正在使用hadoop平台，而我正在尝试的东西是Spark-Streaming API。我试图读取一个文件流来计算每x秒后的单词数（历史的累计总和）。现在我想将top-k字打印到文件上。这是我想要做的：

# sort the dstream for current batch
sorted_counts = counts.transform(lambda rdd: rdd.sortBy(lambda x: x[1], ascending=False))

# get the top K values of each rdd from the transformed dstream
topK = sorted_counts.transform(lambda rdd: rdd.take(k))

我可以使用以下方法将输出打印到控制台/日志文件：

sorted_counts.pprint(k)

但问题是当我尝试使用以下方法将其打印到文件时

topK.saveAsTextFiles(out_path)

或者即使我尝试将topK打印到控制台：

topK.pprint()

我收到以下错误，

AttributeError：'list'对象没有属性'_jrdd'

我假设是因为rdd.take（k）返回实际列表而不是rdd。我该如何解决这个问题？此外，我想为每个新计算的字数生成不同的文件...即每x秒一个新的输出文件（使用saveAsTextFiles（）保证。我使用python编程，如果它有帮助。谢谢！

Answer 1

似乎没有API可以让你这样做。但你可以解决方法：

rdd.zipWithIndex().filter(<filter with big index>).map(<remove index here>)

另一种解决方案是（没有排序）：

sc.parallelize(rdd.top(...))

这样您就不需要对所有RDD进行排序，只需要获取最大的元素，然后从中创建RDD。

Spark流式打印dstream的k-top结果

1 个答案: