Question

我正在寻找一种更快，更优雅的方法来解决以下问题：

给定一个熊猫数据框，我想将当前行和前k（prev_len）行合并为新行（新数据框）。我想对每个有效的旧行执行此操作，即，每行具有k个先前的行。也就是说，每个新行都将由prev_len + 1个旧行组成，这些旧行水平彼此相邻。因此，结果数据帧将比旧数据帧少prev_len行，并且其列数将是prev_len + 1 * number_of_columns_in_old_data_frame。请参见下面带有prev_len=2的示例。提前非常感谢！

给出数据框：

    x1  x2         y
0  166   9 -2.426679
1  192   6 -0.428913
2  198   1  1.265936
3  117   0 -0.866740
4  183   1 -0.678886

所需的数据帧：

   00_x1  00_x2      00_y  01_x1  01_x2      01_y  02_x1  02_x2      02_y
0  166.0    9.0 -2.426679  192.0    6.0 -0.428913  198.0    1.0  1.265936
1  192.0    6.0 -0.428913  198.0    1.0  1.265936  117.0    0.0 -0.866740
2  198.0    1.0  1.265936  117.0    0.0 -0.866740  183.0    1.0 -0.678886

我的解决方案：

import numpy as np
import pandas as pd
import random 

# given data ----------------------------------------------------------
np.random.seed(seed=123)
df = pd.DataFrame({'x1': np.random.randint(100, 200, 5), 
                       'x2': np.random.randint(0,10,5), 
                       'y': np.random.randn(5)})
print(df)

# desired data  -------------------------------------------------------
prev_len = 2

lag = []

for i in range(prev_len + 1):
    lag += [i] * len(df.columns.to_list())

col = df.columns.to_list() * (prev_len + 1)
colnames = ["{:02}_{}".format(lag_, col_) for lag_, col_ in zip(lag, col)]

df_new = pd.DataFrame(columns = colnames)

for i_new, i_old in zip(range(df.shape[0] - prev_len), range(prev_len, df.shape[0])):

    obs = pd.Series()

    print(i_old)

    for j in range(i_old - 2, i_old + 1):

        obs = obs.append(df.iloc[j, :])

    df_new.loc[i_new] = obs.to_list()

print(df_new)

Answer 1

编辑：如@ user3483203所述，我将其概括化

prev_len = 2
key_cols = range(prev_len+1)
df_new = pd.concat([df.shift(-i) for i in key_cols], axis=1, keys=map(str, key_cols)).dropna()
df_new.columns = df_new.columns.map('_'.join)

原始：
对于您需要的prev_len = 2，我认为pd.concat，shift和dropna就足够了

df_new = pd.concat([df, df.shift(-1), df.shift(-2)], axis=1, keys=['0', '1', '2']).dropna()
df_new.columns = df_new.columns.map('_'.join)


Out[556]:
   0_x1  0_x2       0_y   1_x1  1_x2       1_y   2_x1  2_x2       2_y
0   166     9 -2.426679  192.0   6.0 -0.428913  198.0   1.0  1.265936
1   192     6 -0.428913  198.0   1.0  1.265936  117.0   0.0 -0.866740
2   198     1  1.265936  117.0   0.0 -0.866740  183.0   1.0 -0.678886

Answer 2

我将使用skimage.util.view_as_windows，然后再输入reshape。一般而言，您希望window_shape的第一个轴尺寸比k的尺寸大，因此它包括当前行加上k前几行。

from skimage.util import view_as_windows

k = 2
x, y = df.shape
u = df.values

w = view_as_windows(u, window_shape=(k+1, y)).reshape(-1, y*(k+1))

res = pd.DataFrame(
    w, columns=[f'{i:02}_{col}' for i in range(k+1) for col in df.columns]
)

   00_x1  00_x2      00_y  01_x1  01_x2      01_y  02_x1  02_x2      02_y
0  166.0    9.0 -2.426679  192.0    6.0 -0.428913  198.0    1.0  1.265936
1  192.0    6.0 -0.428913  198.0    1.0  1.265936  117.0    0.0 -0.866740
2  198.0    1.0  1.265936  117.0    0.0 -0.866740  183.0    1.0 -0.678886

Answer 3

为了获得最大的灵活性，我喜欢遍历pandas对象。 这种方法在优化之前可能会有混合的性能，因此，我鼓励您修补直到达到适合性能的速度。

em>

初始化数据：

import pandas as pd data = {"index":[0,1,2,3,4], "x1":[166,192,198,117,183], "x2":[9,6,1,0,1], "y":[-2.426679,-0.428913,1.265936,-0.866740, -0.678886]} df = pd.DataFrame(data) df.set_index('index', inplace=True)

迭代并构建新的df：

lookahead = 2 records = [] for idx in df.index[:-lookahead]: # Create an empty record rec = {} # Lookahead + 1 for i in range(0,lookahead+1): # Get the values x1, x2, y = df.iloc[idx+i,: ] # Cycle through, then add for k, v in zip(['x1','x2','y'],[x1,x2,y]): rec[f"{i:02d}_{k}"] = v # Append records.append(rec) # Write your df df_end = pd.DataFrame(records) # yields: 00_x1 00_x2 00_y 01_x1 01_x2 01_y 02_x1 02_x2 02_y 0 166.0 9.0 -2.426679 192.0 6.0 -0.428913 198.0 1.0 1.265936 1 192.0 6.0 -0.428913 198.0 1.0 1.265936 117.0 0.0 -0.866740 2 198.0 1.0 1.265936 117.0 0.0 -0.866740 183.0 1.0 -0.678886

如何快速，快速地将熊猫数据框中的多行转换为1行？

3 个答案: