Question

我正在尝试使用window函数创建一列，以汇总特定年份的收入。例如，我需要一个列作为2020年用户收入。

╔══════╦════════╦═════════╦═════════╗
║ year ║ userid ║ orderid ║ revenue ║
╠══════╬════════╬═════════╬═════════╣
║ 2019 ║      1 ║ a1      ║      50 ║
║ 2020 ║      1 ║ a2      ║     100 ║
║ 2020 ║      1 ║ a3      ║      50 ║
║ 2019 ║      2 ║ b1      ║     100 ║
║ 2020 ║      2 ║ b2      ║     100 ║
╚══════╩════════╩═════════╩═════════╝

我可以使用子查询来实现相同功能，但是我想知道是否可以使用window函数来做到这一点？

 select *, sum(revenue) over (partition by year, userid) as 2020_user_revenue
 from table

当前拥有：

╔══════╦════════╦═════════╦═════════╦═══════════════════╗
║ year ║ userid ║ orderid ║ revenue ║ 2020_user_revenue ║
╠══════╬════════╬═════════╬═════════╬═══════════════════╣
║ 2019 ║      1 ║ a1      ║      50 ║                50 ║
║ 2020 ║      1 ║ a2      ║     100 ║               150 ║
║ 2020 ║      1 ║ a3      ║      50 ║               150 ║
║ 2019 ║      2 ║ b1      ║     100 ║               100 ║
║ 2020 ║      2 ║ b2      ║     100 ║               100 ║
╚══════╩════════╩═════════╩═════════╩═══════════════════╝

预期：

╔══════╦════════╦═════════╦═════════╦═══════════════════╗
║ year ║ userid ║ orderid ║ revenue ║ 2020_user_revenue ║
╠══════╬════════╬═════════╬═════════╬═══════════════════╣
║ 2019 ║      1 ║ a1      ║      50 ║               150 ║
║ 2020 ║      1 ║ a2      ║     100 ║               150 ║
║ 2020 ║      1 ║ a3      ║      50 ║               150 ║
║ 2019 ║      2 ║ b1      ║     100 ║               100 ║
║ 2020 ║      2 ║ b2      ║     100 ║               100 ║
╚══════╩════════╩═════════╩═════════╩═══════════════════╝

Answer 1

您能在下面的脚本中尝试这个吗？

import pandas as pd
import numpy as np
df = pd.DataFrame([
    [1, 0, 2, 2],
    [1, 1, 0, 0],
    [0, 2, 3, 2],
    [2, 2, 1, 1]],
  columns=['col1', 'col2', 'col3', 'col4'])
# cols = df.columns[:-1]

df1 = df.iloc[:,:-1]
df1['threshold']=1

df2 = df1.drop('threshold', 1).gt(df1['threshold'], 0)
df2 = df2.apply(lambda x: ', '.join(x.index[x]),axis=1)

df['d']=df2

print df

输出为-

   col1  col2  col3  col4           d
0     1     0     2     2        col3
1     1     1     0     0            
2     0     2     3     2  col2, col3
3     2     2     1     1  col1, col2

Answer 2

只需使用条件窗口函数：

select t.*,
       sum(case when year = 2020 then revenue else 0
           end) over (partition by userid) as revenue_2020
from t;

基于条件的窗口总和

2 个答案: