Question

尽管我在相似的线条上看到了一个问题，但这不符合我的目的。因此，将其发布在这里，期待得到答案。我有2个pyspark数据框：客户每周在该产品上的花费

产品销售的第一周和最后一周

我需要根据产品的第一周和最后一周的销售情况，使客户在与每种产品相对应的类别中花费。例如对应于产品W，因为它仅在第二周出现，所以我只需要考虑仅在第二周出现的客户类别支出

我正在努力使自己陷入困境，但是没有运气。寻找建议。

Answer 1

我想我明白了。这比我原本想的要复杂，但是我认为这可以满足您的要求：

select t1.*, t2.category_spend
from table1 t1 join
     (select t2.product, sum(t1.spend) as category_spend
      from table1 t1 join
           table2 t2
           on t1.week between t2.weekstart between t2.weekend
      group by t2.product
     ) t2w
     on t2w.product = t1.product;

编辑：

根据您的评论，逻辑基本相同。 customer并没有区别这个问题，因此可以在“无处不在”中添加它：

select t1.*, t2.category_spend
from table1 t1 join
     (select t1.customer, t2.product, sum(t1.spend) as category_spend
      from table1 t1 join
           table2 t2
           on t1.week between t2.weekstart between t2.weekend
      group by t2.product, t1.customer
     ) t2w
     on t2w.product = t1.product and t2w.customer = t1.customer;

在Pyspark中根据另一列的值进行条件汇总

1 个答案: