Question

我有一个带有价格的熊猫数据框，我将要创建一个名为priceLags的列，如下所示：

             price        priceLags
1.           1800
2.           1750          1800

3.           1500          1750
                           1800

4.           1240          1500
                           1750
                           1800

5.           1456          1240
                           1500
                           1750

6.           1302          1456
                           1240
                           1500

priceLags由前3行的价格组成。在SQL中，是

ARRAY_AGG(price) OVER (ORDER BY ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS priceLags

请问我该怎么做在熊猫里？

非常感谢您！

Answer 1

创建相同结构的一种方法是：

创建滞后变量

df['lagged1'] = df['price'].shift(1)
df['lagged2'] = df['price'].shift(2)
df['lagged3'] = df['price'].shift(3)

df
Out[1]
    price   lagged1 lagged2 lagged3
0   1800    NaN     NaN     NaN
1   1750    1800.0  NaN     NaN
2   1500    1750.0  1800.0  NaN
3   1240    1500.0  1750.0  1800.0
4   1456    1240.0  1500.0  1750.0
5   1302    1456.0  1240.0  1500.0

堆叠这些新变量

df.set_index('price').stack(dropna=False)\
   .reset_index(1).drop('level_1', axis=1)\
   .reset_index().rename(columns={0:'priceLags'})

Out[2]:
    price   priceLags
0   1800    NaN
1   1800    NaN
2   1800    NaN
3   1750    1800.0
4   1750    NaN
5   1750    NaN
6   1500    1750.0
7   1500    1800.0
8   1500    NaN
9   1240    1500.0
10  1240    1750.0
11  1240    1800.0
12  1456    1240.0
13  1456    1500.0
14  1456    1750.0
15  1302    1456.0
16  1302    1240.0
17  1302    1500.0

您还可以在该过程中删除空值：

df.set_index('price').stack(dropna=True).reset_index(level=1, drop=True).reset_index().rename(columns={0:'priceLags'})

Out[3]:
    price   priceLags
0   1750    1800.0
1   1500    1750.0
2   1500    1800.0
3   1240    1500.0
...
10  1302    1240.0
11  1302    1500.0

已添加

四处查看后，我发现this great answer涉及如何以编程方式创建滞后时间的列。然后，我们可以在一次代码调用中堆叠和重置索引几次，以获得最终结果：

df.assign(**{
        f'{col}_{t}': df[col].shift(t)
        for t in lags
        for col in df
    })\
    .set_index('price').stack(dropna=True)\ #group into one column
    .reset_index(level=1, drop=True)\ #remove the column names
    .reset_index().rename(columns={0:'priceLags'}) #reinsert the correct col names

Answer 2

另一种方法是定义自定义聚合函数。不是下面最精美的代码，但可能会满足您的要求：

# import some packages
import pandas as pd
from functools import reduce 

# create a test dataframe
df = pd.DataFrame([
    {'a': 'hello', 'b': 1},
    {'a': 'hello', 'b': 5},
    {'a': 'hello', 'b': 6},
    {'a': 'bubye', 'b': 3},
    {'a': 'bubye', 'b': 2},
    {'a': 'bonus', 'b': 3}
])

# define custom aggregation function
def create_list(series):
    if len(series) == 1:
        return [x for x in series]
    return reduce(lambda x, y: ([x] if type(x) == int else x) + [y], series)

# apply different aggregation functions, including your custom one
(
    df
    .groupby("a")
    .agg({
        "b": ['sum', 'max', create_list],
    })
)

SQL中Pandas DataFrame的ARRAY_AGG等效项是什么？

2 个答案: