每个组的Apache Spark计数记录具有nullables值

时间:2017-12-27 14:58:38

标签: java apache-spark

当我尝试计算每组的记录数时,我看到,那个带有空值的组没有记录,但这不正确。

输入数据框:

+--------+
|    Name|
+--------+
|  Andrei|
|  Andrei|
|    null|
|    null|
|Grigorii|
+--------+

代码:

Dataset<Row> df = inputDf.groupBy("Name")
            .agg(functions.count("Name").as("Name_count"));

实际DataFrame:

+--------+----------+
|    Name|Name_count|
+--------+----------+
|    null|         0|
|  Andrei|         2|
|Grigorii|         1|
+--------+----------+

预期的DataFrame:

+--------+----------+
|    Name|Name_count|
+--------+----------+
|    null|         2|
|  Andrei|         2|
|Grigorii|         1|
+--------+----------+

1 个答案:

答案 0 :(得分:0)

这有效:

Dataset<Row> storageFrame = leftDataset.groupBy("Name")
            .agg(functions.count("*").as("Name_count"));