我有一个带有价格的熊猫数据框,我将要创建一个名为priceLags的列,如下所示:
price priceLags
1. 1800
2. 1750 1800
3. 1500 1750
1800
4. 1240 1500
1750
1800
5. 1456 1240
1500
1750
6. 1302 1456
1240
1500
priceLags由前3行的价格组成。在SQL中,是
ARRAY_AGG(price) OVER (ORDER BY ROWS BETWEEN 1 FOLLOWING AND 3 FOLLOWING) AS priceLags
请问我该怎么做在熊猫里?
非常感谢您!
答案 0 :(得分:0)
创建相同结构的一种方法是:
df['lagged1'] = df['price'].shift(1)
df['lagged2'] = df['price'].shift(2)
df['lagged3'] = df['price'].shift(3)
df
Out[1]
price lagged1 lagged2 lagged3
0 1800 NaN NaN NaN
1 1750 1800.0 NaN NaN
2 1500 1750.0 1800.0 NaN
3 1240 1500.0 1750.0 1800.0
4 1456 1240.0 1500.0 1750.0
5 1302 1456.0 1240.0 1500.0
df.set_index('price').stack(dropna=False)\
.reset_index(1).drop('level_1', axis=1)\
.reset_index().rename(columns={0:'priceLags'})
Out[2]:
price priceLags
0 1800 NaN
1 1800 NaN
2 1800 NaN
3 1750 1800.0
4 1750 NaN
5 1750 NaN
6 1500 1750.0
7 1500 1800.0
8 1500 NaN
9 1240 1500.0
10 1240 1750.0
11 1240 1800.0
12 1456 1240.0
13 1456 1500.0
14 1456 1750.0
15 1302 1456.0
16 1302 1240.0
17 1302 1500.0
您还可以在该过程中删除空值:
df.set_index('price').stack(dropna=True).reset_index(level=1, drop=True).reset_index().rename(columns={0:'priceLags'})
Out[3]:
price priceLags
0 1750 1800.0
1 1500 1750.0
2 1500 1800.0
3 1240 1500.0
...
10 1302 1240.0
11 1302 1500.0
已添加
四处查看后,我发现this great answer涉及如何以编程方式创建滞后时间的列。然后,我们可以在一次代码调用中堆叠和重置索引几次,以获得最终结果:
df.assign(**{
f'{col}_{t}': df[col].shift(t)
for t in lags
for col in df
})\
.set_index('price').stack(dropna=True)\ #group into one column
.reset_index(level=1, drop=True)\ #remove the column names
.reset_index().rename(columns={0:'priceLags'}) #reinsert the correct col names
答案 1 :(得分:0)
另一种方法是定义自定义聚合函数。不是下面最精美的代码,但可能会满足您的要求:
# import some packages
import pandas as pd
from functools import reduce
# create a test dataframe
df = pd.DataFrame([
{'a': 'hello', 'b': 1},
{'a': 'hello', 'b': 5},
{'a': 'hello', 'b': 6},
{'a': 'bubye', 'b': 3},
{'a': 'bubye', 'b': 2},
{'a': 'bonus', 'b': 3}
])
# define custom aggregation function
def create_list(series):
if len(series) == 1:
return [x for x in series]
return reduce(lambda x, y: ([x] if type(x) == int else x) + [y], series)
# apply different aggregation functions, including your custom one
(
df
.groupby("a")
.agg({
"b": ['sum', 'max', create_list],
})
)