我最近注意到了区别。例如
from pyspark.sql.functions import count, countDistinct
graduateProgram = spark.createDataFrame([
(0, "Masters", "School of Information", "UC Berkeley"),
(2, "Masters", "EECS", "UC Berkeley"),
(3, None, "School of Information", "UC Berkeley"),
(1, "Ph.D.", "EECS", "UC Berkeley")])\
.toDF("id", "degree", "department", "school")
graduateProgram.select("degree", "department").distinct().show()
返回
虽然
graduateProgram.select(countDistinct("degree", "department")).show()
返回
为什么会这样?这是预期的吗?