应用错误收集

我最近注意到了区别。例如

from pyspark.sql.functions import count, countDistinct
graduateProgram = spark.createDataFrame([
    (0, "Masters", "School of Information", "UC Berkeley"),
    (2, "Masters", "EECS", "UC Berkeley"),
    (3, None, "School of Information", "UC Berkeley"),
    (1, "Ph.D.", "EECS", "UC Berkeley")])\
  .toDF("id", "degree", "department", "school")


graduateProgram.select("degree", "department").distinct().show()

虽然

graduateProgram.select(countDistinct("degree", "department")).show()

为什么会这样？这是预期的吗？

Python Spark：.distinct（）。count（）和countDistinct（）之间的区别

0 个答案: