我正在寻找一种更快,更优雅的方法来解决以下问题:
给定一个熊猫数据框,我想将当前行和前k(prev_len
)行合并为新行(新数据框)。我想对每个有效的旧行执行此操作,即,每行具有k个先前的行。也就是说,每个新行都将由prev_len + 1
个旧行组成,这些旧行水平彼此相邻。因此,结果数据帧将比旧数据帧少prev_len
行,并且其列数将是prev_len + 1
* number_of_columns_in_old_data_frame
。请参见下面带有prev_len=2
的示例。提前非常感谢!
给出数据框:
x1 x2 y
0 166 9 -2.426679
1 192 6 -0.428913
2 198 1 1.265936
3 117 0 -0.866740
4 183 1 -0.678886
所需的数据帧:
00_x1 00_x2 00_y 01_x1 01_x2 01_y 02_x1 02_x2 02_y
0 166.0 9.0 -2.426679 192.0 6.0 -0.428913 198.0 1.0 1.265936
1 192.0 6.0 -0.428913 198.0 1.0 1.265936 117.0 0.0 -0.866740
2 198.0 1.0 1.265936 117.0 0.0 -0.866740 183.0 1.0 -0.678886
我的解决方案:
import numpy as np
import pandas as pd
import random
# given data ----------------------------------------------------------
np.random.seed(seed=123)
df = pd.DataFrame({'x1': np.random.randint(100, 200, 5),
'x2': np.random.randint(0,10,5),
'y': np.random.randn(5)})
print(df)
# desired data -------------------------------------------------------
prev_len = 2
lag = []
for i in range(prev_len + 1):
lag += [i] * len(df.columns.to_list())
col = df.columns.to_list() * (prev_len + 1)
colnames = ["{:02}_{}".format(lag_, col_) for lag_, col_ in zip(lag, col)]
df_new = pd.DataFrame(columns = colnames)
for i_new, i_old in zip(range(df.shape[0] - prev_len), range(prev_len, df.shape[0])):
obs = pd.Series()
print(i_old)
for j in range(i_old - 2, i_old + 1):
obs = obs.append(df.iloc[j, :])
df_new.loc[i_new] = obs.to_list()
print(df_new)
答案 0 :(得分:3)
编辑:如@ user3483203所述,我将其概括化
prev_len = 2
key_cols = range(prev_len+1)
df_new = pd.concat([df.shift(-i) for i in key_cols], axis=1, keys=map(str, key_cols)).dropna()
df_new.columns = df_new.columns.map('_'.join)
原始:
对于您需要的prev_len = 2
,我认为pd.concat
,shift
和dropna
就足够了
df_new = pd.concat([df, df.shift(-1), df.shift(-2)], axis=1, keys=['0', '1', '2']).dropna()
df_new.columns = df_new.columns.map('_'.join)
Out[556]:
0_x1 0_x2 0_y 1_x1 1_x2 1_y 2_x1 2_x2 2_y
0 166 9 -2.426679 192.0 6.0 -0.428913 198.0 1.0 1.265936
1 192 6 -0.428913 198.0 1.0 1.265936 117.0 0.0 -0.866740
2 198 1 1.265936 117.0 0.0 -0.866740 183.0 1.0 -0.678886
答案 1 :(得分:1)
我将使用skimage.util.view_as_windows
,然后再输入reshape
。一般而言,您希望window_shape
的第一个轴尺寸比k
的尺寸大 ,因此它包括当前行加上k
前几行。
from skimage.util import view_as_windows
k = 2
x, y = df.shape
u = df.values
w = view_as_windows(u, window_shape=(k+1, y)).reshape(-1, y*(k+1))
res = pd.DataFrame(
w, columns=[f'{i:02}_{col}' for i in range(k+1) for col in df.columns]
)
00_x1 00_x2 00_y 01_x1 01_x2 01_y 02_x1 02_x2 02_y
0 166.0 9.0 -2.426679 192.0 6.0 -0.428913 198.0 1.0 1.265936
1 192.0 6.0 -0.428913 198.0 1.0 1.265936 117.0 0.0 -0.866740
2 198.0 1.0 1.265936 117.0 0.0 -0.866740 183.0 1.0 -0.678886
答案 2 :(得分:0)
为了获得最大的灵活性,我喜欢遍历pandas对象。 这种方法在优化之前可能会有混合的性能,因此,我鼓励您修补直到达到适合性能的速度。
em>初始化数据:
import pandas as pd
data = {"index":[0,1,2,3,4],
"x1":[166,192,198,117,183],
"x2":[9,6,1,0,1],
"y":[-2.426679,-0.428913,1.265936,-0.866740, -0.678886]}
df = pd.DataFrame(data)
df.set_index('index', inplace=True)
迭代并构建新的df
:
lookahead = 2
records = []
for idx in df.index[:-lookahead]:
# Create an empty record
rec = {}
# Lookahead + 1
for i in range(0,lookahead+1):
# Get the values
x1, x2, y = df.iloc[idx+i,: ]
# Cycle through, then add
for k, v in zip(['x1','x2','y'],[x1,x2,y]):
rec[f"{i:02d}_{k}"] = v
# Append
records.append(rec)
# Write your df
df_end = pd.DataFrame(records)
# yields:
00_x1 00_x2 00_y 01_x1 01_x2 01_y 02_x1 02_x2 02_y
0 166.0 9.0 -2.426679 192.0 6.0 -0.428913 198.0 1.0 1.265936
1 192.0 6.0 -0.428913 198.0 1.0 1.265936 117.0 0.0 -0.866740
2 198.0 1.0 1.265936 117.0 0.0 -0.866740 183.0 1.0 -0.678886