Question

我需要将列articleId的值聚合到一个数组中。这需要在我事先根据groupBy创建的组中完成。

我的表格如下：

| customerId | articleId | articleText | ...
|    1       |     1     |   ...       | ...
|    1       |     2     |   ...       | ...
|    2       |     1     |   ...       | ...
|    2       |     2     |   ...       | ...
|    2       |     3     |   ...       | ...

我想构建像

这样的东西

| customerId |  articleIds |
|    1       |  [1, 2]     |
|    2       |  [1, 2, 3]  |

到目前为止我的代码：

DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));

但在这里我得到了AnalysisException：

Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;

有人可以帮助建立正确的陈述吗？

Answer 1

对于SQL语法，如果要按某种方式进行分组，则必须在select语句中包含此“something”。也许在你的sparkSQL代码中，没有说明这一点。

您有类似的问题，所以我认为这是您的问题的解决方案SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

Answer 2

这可以使用collect_list功能实现，但只有在您使用HiveContext时才可用：

import org.apache.spark.sql.functions._

df.groupBy("customerId").agg(collect_list("articleId"))

Spark SQL：聚合组中的列值

2 个答案: