Question

我有一个这样的Spark数据框

event_name | id
---------------
hello      | 1
hello      | 2
hello      | 1
world      | 1
hello      | 3
world      | 2

我想根据唯一的“ id”来计算特定事件“ hello”的数量。 SQL应该看起来像这样

SELECT event_name, COUNT(DISTINCT id) as count
FROM table_name
WHERE event_name="hello"

event_name | count
------------------
hello      | 3

所以我的查询应为“ hello”返回3而不是4，因为对于“ hello”有两行ID为“ 1”的行。

我该如何使用pyspark SQL？

Answer 1

这应该可以解决问题：

df.groupBy("event_name").agg(F.countDistinct("id")).show()