SQL中Pandas DataFrame的ARRAY_AGG等效项是什么?

时间:2019-08-01 03:35:25

标签: python sql pandas

我有一个带有价格的熊猫数据框,我将要创建一个名为priceLags的列,如下所示:

             price        priceLags
1.           1800
2.           1750          1800

3.           1500          1750
                           1800

4.           1240          1500
                           1750
                           1800

5.           1456          1240
                           1500
                           1750

6.           1302          1456
                           1240
                           1500

priceLags由前3行的价格组成。在SQL中,是

ARRAY_AGG(price) OVER (ORDER BY ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS priceLags

请问我该怎么做在熊猫里?

非常感谢您!

2 个答案:

答案 0 :(得分:0)

创建相同结构的一种方法是:

  1. 创建滞后变量
df['lagged1'] = df['price'].shift(1)
df['lagged2'] = df['price'].shift(2)
df['lagged3'] = df['price'].shift(3)

df
Out[1]
    price   lagged1 lagged2 lagged3
0   1800    NaN     NaN     NaN
1   1750    1800.0  NaN     NaN
2   1500    1750.0  1800.0  NaN
3   1240    1500.0  1750.0  1800.0
4   1456    1240.0  1500.0  1750.0
5   1302    1456.0  1240.0  1500.0
  1. 堆叠这些新变量
df.set_index('price').stack(dropna=False)\
   .reset_index(1).drop('level_1', axis=1)\
   .reset_index().rename(columns={0:'priceLags'})

Out[2]:
    price   priceLags
0   1800    NaN
1   1800    NaN
2   1800    NaN
3   1750    1800.0
4   1750    NaN
5   1750    NaN
6   1500    1750.0
7   1500    1800.0
8   1500    NaN
9   1240    1500.0
10  1240    1750.0
11  1240    1800.0
12  1456    1240.0
13  1456    1500.0
14  1456    1750.0
15  1302    1456.0
16  1302    1240.0
17  1302    1500.0

您还可以在该过程中删除空值:

df.set_index('price').stack(dropna=True).reset_index(level=1, drop=True).reset_index().rename(columns={0:'priceLags'})

Out[3]:
    price   priceLags
0   1750    1800.0
1   1500    1750.0
2   1500    1800.0
3   1240    1500.0
...
10  1302    1240.0
11  1302    1500.0

已添加

四处查看后,我发现this great answer涉及如何以编程方式创建滞后时间的列。然后,我们可以在一次代码调用中堆叠和重置索引几次,以获得最终结果:

df.assign(**{
        f'{col}_{t}': df[col].shift(t)
        for t in lags
        for col in df
    })\
    .set_index('price').stack(dropna=True)\ #group into one column
    .reset_index(level=1, drop=True)\ #remove the column names
    .reset_index().rename(columns={0:'priceLags'}) #reinsert the correct col names

答案 1 :(得分:0)

另一种方法是定义自定义聚合函数。不是下面最精美的代码,但可能会满足您的要求:

# import some packages
import pandas as pd
from functools import reduce 

# create a test dataframe
df = pd.DataFrame([
    {'a': 'hello', 'b': 1},
    {'a': 'hello', 'b': 5},
    {'a': 'hello', 'b': 6},
    {'a': 'bubye', 'b': 3},
    {'a': 'bubye', 'b': 2},
    {'a': 'bonus', 'b': 3}
])

# define custom aggregation function
def create_list(series):
    if len(series) == 1:
        return [x for x in series]
    return reduce(lambda x, y: ([x] if type(x) == int else x) + [y], series)

# apply different aggregation functions, including your custom one
(
    df
    .groupby("a")
    .agg({
        "b": ['sum', 'max', create_list],
    })
)