Question

为什么我为..agg(countDistinct("member_id") as "count")和..distinct.count获得不同的输出？差异是否与select count(distinct member_id)和select distinct count(member_id)之间的差异相同？

Answer 1

为什么我将..agg（countDistinct（＆＃34; member_id＆＃34;））的不同输出作为＆＃34; count＆＃34;）和..distinct.count？

因为.distinct.count是相同的：

SELECT COUNT(*) FROM (SELECT DISTINCT member_id FROM table)

而..agg(countDistinct("member_id") as "count")是

SELECT COUNT(DISTINCT member_id) FROM table

和COUNT(*) uses different rules than COUNT(column) when nulls are encountered。

Answer 2

df.agg(countDistinct("member_id") as "count")

返回member_id列的不同值的数量，忽略所有其他列，而

df.distinct.count

将计算DataFrame中不同记录的数量 - 其中＆＃34; distinct＆＃34;表示所有列的值相同。

所以，例如，DataFrame：

+-----------+---------+
|member_name|member_id|
+-----------+---------+
|          a|        1|
|          b|        1|
|          b|        1|
+-----------+---------+

只有一个不同的member_id值但有两个不同的记录，因此agg选项将返回1而后者将返回2.

Answer 3

第一个命令：

DF.agg(countDistinct("member_id") as "count")

返回与select count distinct(member_id) from DF相同的内容。

第二个命令：

DF.distinct.count

实际上是从DF获取不同的记录或删除重复项，然后计算。

countDistinct和distinct.count

3 个答案: