Question

我在Ubuntu环境中使用Spark 1.5.2和Python 2.7。

根据 countByValue 和 countByValueAndWindow 的文档： Transformations on dstreams
Window operations

countByValue：当在类型为K的元素的DStream上调用时，返回一个（K，Long）对的新DStream，其中每个键的值是其在源DStream的每个RDD中的频率

countByValueAndWindow：当在（K，V）对的DStream上调用时，返回（K，Long）对的新DStream，其中每个键的值是其在滑动窗口内的频率。与reduceByKeyAndWindow一样，reduce任务的数量可通过可选参数进行配置。

所以基本上这两个函数的返回值应该是（K，Long）对的列表，对吗？

然而，当我做一些实验时，返回值结果是整数列表，而不是对！

更重要的是，在Github上为pySpark提供的官方测试代码： Link1 Link2

您可以看到“预期结果”是整数列表！在我看来，它正在计算不同元素的数量并将它们组合在一起。

我认为我在某种程度上误解了文档，直到我在Github上看到 scala 上的测试代码：Link1 Link2

类似的测试用例，但此时结果一系列对！

总而言之，scala的文档和测试用例告诉我们结果是成对的。但是python测试用例和我自己的实验表明结果是整数。

我是PySpark的新手和火花流媒体。有人可以帮我解释一下这种不一致吗？现在我正在使用reduceByKey和reduceByKeyAndWindow作为解决方法。

参考文献：

PySpark streaming documentation about countByValue
PySpark streaming documentation about countByVauleAndWindow
Dpark test cases of countByVauleAndWindow
An example using countByValue in PySpark (not streaming)

更新

此错误计划在pyspark 2.0.0中修复

Answer 1

我同意，countByValueAndWindow有一个错误，它应该返回按值计数，而不仅仅是没有值的计数。即使您在Python中运行与Scala版本相同的测试用例（link），您也可以看到此函数的pyspark版本如何仅返回计数而不是它们的值（例如，对）

>>> input = [['a'], ['b', 'b'], ['a', 'b']]
>>> from pyspark.streaming import StreamingContext
>>> ssc = StreamingContext(sc, 1)
>>> input = [sc.parallelize(d, 1) for d in input]
>>> input_stream = ssc.queueStream(input)
>>> input_stream2 = input_stream.countByValueAndWindow(2, 1)
>>> def f(rdd):
...     rows = rdd.collect()
...     for r in rows:
...         print r
... 
>>> input_stream2.foreachRDD(f)
>>> 
>>> sc.setCheckpointDir('/home/xxxx/checkpointdir')
>>> ssc.start() 
>>> 1
2
2
2
0

你应该将此作为Jira（link）中的错误提出，这应该很容易解决。我无法看到任何人如何以当前形式使用此函数，因为没有键，返回的值数字毫无意义。

python中的Spark流：countByValue和countByValueAndWindow中的错误？

1 个答案: