Question

我有一个数据框（testdf），并希望对列（memid）进行计数和不同计数，其中另一列（booking / rental）是不为空或不为空（即。＆＃34;＆＃34;）

testdf：

memid   booking  rental
100        Y 
100
120        Y
100        Y       Y

预期结果:(预订栏不为空/非空）

count(memid)  count(distinct memid)
      3                      2

如果是SQL：

Select count(memid), count(distinct memid) from mydf 
where booking is not null and booking!= ""

在PySpark中：

mydf.filter("booking!=''").groupBy('booking').agg(count("patid"), countDistinct("patid"))

但我只是想要整体计数而不是按照...分组。

Answer 1

您只需删除GroupBy并直接使用agg。
像这样。

from pyspark.sql import functions as F 
mydf=mydf.filter("booking!=''").agg(F.count("patid"), F.countDistinct("patid"))