Python Spark:.distinct()。count()和countDistinct()之间的区别

时间:2018-05-17 11:16:36

标签: apache-spark pyspark

我最近注意到了区别。例如

from pyspark.sql.functions import count, countDistinct
graduateProgram = spark.createDataFrame([
    (0, "Masters", "School of Information", "UC Berkeley"),
    (2, "Masters", "EECS", "UC Berkeley"),
    (3, None, "School of Information", "UC Berkeley"),
    (1, "Ph.D.", "EECS", "UC Berkeley")])\
  .toDF("id", "degree", "department", "school")


graduateProgram.select("degree", "department").distinct().show()

返回

enter image description here

虽然

graduateProgram.select(countDistinct("degree", "department")).show()

返回

enter image description here

为什么会这样?这是预期的吗?

0 个答案:

没有答案