Question

我希望根据数据框中的所有列组合来获取客户数。

例如： - 假设我的数据帧为5列。

id，col1，col2，col3，cust_id

我需要所有组合的客户数量：

    id, col1, count(cust_id)
    id, col1, col2, count(cust_id)
    id, col1, col3, count(cust_id)
    id, col1, col2, col3, count(cust_id)
    id, col2, count(cust_id)
    id, col2, col3, count(cust_id)

等等所有排列和组合。

很难单独为数据框的groupBy函数提供所有不同的组合，然后聚合客户的数量。

我们有什么方法可以实现这一点，然后将所有结果组合起来将它添加到一个数据框中，我们可以将结果写在一个输出文件中。

对我来说，它看起来有点复杂，非常感谢任何人可以提供任何解决方案。如果需要更多细节，请告诉我。

非常感谢。

Answer 1

可能and it is called cube：

df.cube("id", "col1", "col2", "col3").agg(count("cust_id"))
  .na.drop(minNonNulls=3)  // To exclude some combinations

SQL版本还提供GROUPING SET，效率高于.na.drop。

在spark数据帧中的多个列上聚合（所有组合）

1 个答案: