我是pyspark的新手,所以我觉得这里缺少一些简单而又愚蠢的东西,但我很想念它。我将镶木地板文件的内容读取到spark数据帧中。然后,我对一个二进制列进行分组并计算结果。每次都有不同的数字。这是在Spark节点之间同步数据的问题,还是与延迟执行有关?或者我只是缺少一些基本的基本Spark原理?这些结果让我感到非常困惑。
df = spark.read.parquet(input_file)
df = df.limit(2000)
print(df.count())
print(df.groupBy('STATUS').count().collect())
print(df.groupBy('STATUS').count().collect())
print(df.groupBy('STATUS').count().collect())
>>> 2000
>>> [Row(STATUS=0, count=1613), Row(STATUS=1, count=387)]
>>> [Row(STATUS=0, count=1528), Row(STATUS=1, count=472)]
>>> [Row(STATUS=0, count=1646), Row(STATUS=1, count=354)]
以下是df模式:
root
|-- GRP_ID: long (nullable = true)
|-- WEK_ID: long (nullable = true)
|-- WEK_BGN_DT: string (nullable = true)
|-- WEK_END_DT: string (nullable = true)
|-- FEATURES: vector (nullable = true)
|-- STATUS: long (nullable = true)
我还应该注意,如果将spark数据框转换为pandas并得到计数,则可以正常工作:
dfp = df.toPandas()
print(dfp['STATUS'][dfp['STATUS'] == 0].count())
print(dfp['STATUS'][dfp['STATUS'] == 1].count())
print(dfp['STATUS'][dfp['STATUS'] == 0].count())
print(dfp['STATUS'][dfp['STATUS'] == 1].count())
print(dfp['STATUS'][dfp['STATUS'] == 0].count())
print(dfp['STATUS'][dfp['STATUS'] == 1].count())
>>> 1494
>>> 506
>>> 1494
>>> 506
>>> 1494
>>> 506