如何快速,快速地将熊猫数据框中的多行转换为1行?

时间:2019-06-18 19:58:32

标签: python python-3.x pandas

我正在寻找一种更快,更优雅的方法来解决以下问题:

给定一个熊猫数据框,我想将当前行和前k(prev_len)行合并为新行(新数据框)。我想对每个有效的旧行执行此操作,即,每行具有k个先前的行。也就是说,每个新行都将由prev_len + 1个旧行组成,这些旧行水平彼此相邻。因此,结果数据帧将比旧数据帧少prev_len行,并且其列数将是prev_len + 1 * number_of_columns_in_old_data_frame。请参见下面带有prev_len=2的示例。提前非常感谢!


给出数据框:

    x1  x2         y
0  166   9 -2.426679
1  192   6 -0.428913
2  198   1  1.265936
3  117   0 -0.866740
4  183   1 -0.678886

所需的数据帧:

   00_x1  00_x2      00_y  01_x1  01_x2      01_y  02_x1  02_x2      02_y
0  166.0    9.0 -2.426679  192.0    6.0 -0.428913  198.0    1.0  1.265936
1  192.0    6.0 -0.428913  198.0    1.0  1.265936  117.0    0.0 -0.866740
2  198.0    1.0  1.265936  117.0    0.0 -0.866740  183.0    1.0 -0.678886

我的解决方案:

import numpy as np
import pandas as pd
import random 

# given data ----------------------------------------------------------
np.random.seed(seed=123)
df = pd.DataFrame({'x1': np.random.randint(100, 200, 5), 
                       'x2': np.random.randint(0,10,5), 
                       'y': np.random.randn(5)})
print(df)

# desired data  -------------------------------------------------------
prev_len = 2

lag = []

for i in range(prev_len + 1):
    lag += [i] * len(df.columns.to_list())

col = df.columns.to_list() * (prev_len + 1)
colnames = ["{:02}_{}".format(lag_, col_) for lag_, col_ in zip(lag, col)]

df_new = pd.DataFrame(columns = colnames)

for i_new, i_old in zip(range(df.shape[0] - prev_len), range(prev_len, df.shape[0])):

    obs = pd.Series()

    print(i_old)

    for j in range(i_old - 2, i_old + 1):

        obs = obs.append(df.iloc[j, :])

    df_new.loc[i_new] = obs.to_list()

print(df_new)

3 个答案:

答案 0 :(得分:3)

编辑:如@ user3483203所述,我将其概括化

prev_len = 2
key_cols = range(prev_len+1)
df_new = pd.concat([df.shift(-i) for i in key_cols], axis=1, keys=map(str, key_cols)).dropna()
df_new.columns = df_new.columns.map('_'.join)

原始
对于您需要的prev_len = 2,我认为pd.concatshiftdropna就足够了

df_new = pd.concat([df, df.shift(-1), df.shift(-2)], axis=1, keys=['0', '1', '2']).dropna()
df_new.columns = df_new.columns.map('_'.join)


Out[556]:
   0_x1  0_x2       0_y   1_x1  1_x2       1_y   2_x1  2_x2       2_y
0   166     9 -2.426679  192.0   6.0 -0.428913  198.0   1.0  1.265936
1   192     6 -0.428913  198.0   1.0  1.265936  117.0   0.0 -0.866740
2   198     1  1.265936  117.0   0.0 -0.866740  183.0   1.0 -0.678886

答案 1 :(得分:1)

我将使用skimage.util.view_as_windows,然后再输入reshape。一般而言,您希望window_shape的第一个轴尺寸比k的尺寸大 ,因此它包括当前行加上k前几行。


from skimage.util import view_as_windows

k = 2
x, y = df.shape
u = df.values

w = view_as_windows(u, window_shape=(k+1, y)).reshape(-1, y*(k+1))

res = pd.DataFrame(
    w, columns=[f'{i:02}_{col}' for i in range(k+1) for col in df.columns]
)

   00_x1  00_x2      00_y  01_x1  01_x2      01_y  02_x1  02_x2      02_y
0  166.0    9.0 -2.426679  192.0    6.0 -0.428913  198.0    1.0  1.265936
1  192.0    6.0 -0.428913  198.0    1.0  1.265936  117.0    0.0 -0.866740
2  198.0    1.0  1.265936  117.0    0.0 -0.866740  183.0    1.0 -0.678886

答案 2 :(得分:0)

为了获得最大的灵活性,我喜欢遍历pandas对象。 这种方法在优化之前可能会有混合的性能,因此,我鼓励您修补直到达到适合性能的速度。

em>

初始化数据:

import pandas as pd

data = {"index":[0,1,2,3,4],
       "x1":[166,192,198,117,183],
       "x2":[9,6,1,0,1],
       "y":[-2.426679,-0.428913,1.265936,-0.866740, -0.678886]}

df = pd.DataFrame(data)
df.set_index('index', inplace=True)

迭代并构建新的df


lookahead = 2

records = []

for idx in df.index[:-lookahead]:
  # Create an empty record
  rec = {}
  # Lookahead + 1
  for i in range(0,lookahead+1):
    # Get the values
    x1, x2, y = df.iloc[idx+i,: ]
    # Cycle through, then add
    for k, v in zip(['x1','x2','y'],[x1,x2,y]):
      rec[f"{i:02d}_{k}"] = v
  # Append
  records.append(rec)

# Write your df
df_end = pd.DataFrame(records)

# yields:
    00_x1   00_x2   00_y    01_x1   01_x2   01_y    02_x1   02_x2   02_y
0   166.0   9.0 -2.426679   192.0   6.0 -0.428913   198.0   1.0 1.265936
1   192.0   6.0 -0.428913   198.0   1.0 1.265936    117.0   0.0 -0.866740
2   198.0   1.0 1.265936    117.0   0.0 -0.866740   183.0   1.0 -0.678886