Question

我使用以下代码每年对学生进行一次聚会。目的是了解每年的学生总数。

from pyspark.sql.functions import col
import pyspark.sql.functions as fn
gr = Df2.groupby(['Year'])
df_grouped = 
gr.agg(fn.count(col('Student_ID')).alias('total_student_by_year'))

结果是：

[年度学生] [1]

我发现有这么多ID被重复的问题所以结果是错误的和巨大的。

我希望按年份对学生进行聚会，按年份计算学生总数，并将ID重复计算。

我希望问题很清楚。我是新成员感谢

Answer 1

使用 countDistinct 功能

from pyspark.sql.functions import countDistinct
x = [("2001","id1"),("2002","id1"),("2002","id1"),("2001","id1"),("2001","id2"),("2001","id2"),("2002","id2")]
y = spark.createDataFrame(x,["year","id"])

gr = y.groupBy("year").agg(countDistinct("id"))
gr.show()

输出

+----+------------------+
|year|count(DISTINCT id)|
+----+------------------+
|2002|                 2|
|2001|                 2|
+----+------------------+

Answer 2

您也可以这样做：

gr.groupBy("year", "id").count().groupBy("year").count()

此查询每年将返回唯一身份的学生。

Answer 3

流中不支持

countDistinct（）和多个aggr。

如何在pyspark中的groupBy之后计算唯一ID

3 个答案: