unit-testing - 在python中对火花流进行单元测试的好方法是什么

我正在尝试测试Spark Streaming的功能，并希望看到DStreams是我所期望的。如何访问DStream来检查其内容？有没有一种方法可以将DStreams序列化或转换为其他数据类型，例如数组或字典？

这是我的初始代码，假设我有一个./tmp文件夹，并且每秒添加3行文本的文件：

import unittest

class StreamingMethodTest(unittest.TestCase):


    def test_streaming_filesystem(self):
        file_dir = "./tmp"

        def test_fun(rdd):
            # check if "count" equals 3
            rdd.foreach(lambda record: self.assertEqual(record[1], 3))

        sc = SparkContext(appName="streaming test")
        ssc = StreamingContext(sc, 1)

        lines = ssc.textFileStream(file_dir)
        counts = lines.map(lambda line: ("count", 1)).reduceByKey(lambda a, b: a + b)
        counts.foreachRDD(test_fun)
        ssc.start()
        ssc.awaitTermination(10) # terminate after 10 seconds

在python中对火花流进行单元测试的好方法是什么

0 个答案: