Question

我想扩展数据框并定期复制该行。

import pandas as pd 
import numpy as np 
def expandData(data, timeStep=2, sampleLen= 5):
    dataEp = pd.DataFrame()
    for epoch in range(int(len(data)/sampleLen)):
        dataSample = data.iloc[epoch*sampleLen:(epoch+1)*sampleLen, :]
        for num in range(int(sampleLen-timeStep +1)):
            tempDf = dataSample.iloc[num:timeStep+num,:]
            dataEp = pd.concat([dataEp, tempDf],axis= 0)
    return dataEp

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
dfEp = expandData(df, 3, 5)

Output:

df
     a  other
0   0    100
1   1    101
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
8  18    108
9  19    109

dfEp
     a  other
0   0    100
1   1    101
2   2    102
1   1    101
2   2    102
3   3    103
2   2    102
3   3    103
4   4    104
5  15    105
6  16    106
7  17    107
6  16    106
7  17    107
8  18    108
7  17    107
8  18    108
9  19    109

Expected:

我希望有一个更好的方法来实现它，并获得良好的性能，就像数据框具有较大的行大小（例如4万行）一样，我的代码将运行约20分钟。

Edit:

实际上，我希望重复一个小序列，大小为timeStep。我将expandData(df, 2, 5)更改为expandData(df, 3, 5)。

Answer 1

如果您的a值均匀分布，则可以测试序列中的中断，然后根据this answer复制每个连续序列中的行：

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})
#equally spaced rows have value zero, start/stop rows not
df["start/stop"] = df.a.diff().shift(-1) - df.a.diff()
#repeat rows with value zero in the new column
repeat = [2 if val == 0 else 1 for val in df["start/stop"]]
df = df.loc[np.repeat(df.index.values, repeat)]
print(df)

示例输出：

    a  other  start/stop
0   0    100         NaN
1   1    101         0.0
1   1    101         0.0
2   2    102         0.0
2   2    102         0.0
3   3    103         0.0
3   3    103         0.0
4   4    104        10.0
5  15    105       -10.0
6  16    106         0.0
6  16    106         0.0
7  17    107         0.0
7  17    107         0.0
8  18    108         0.0
8  18    108         0.0
9  19    109         NaN

如果它只是一个纪元长度（您没有明确指定规则），那么它甚至更简单：

df = pd.DataFrame({'a':list(np.arange(5))+list(np.arange(15,20)),
'other':list(np.arange(100,110))})

sampleLen = 5
repeat = np.repeat([2], sampleLen)
repeat[0] = repeat[-1] = 1
repeat = np.tile(repeat, len(df)//sampleLen)

df = df.loc[np.repeat(df.index.values, repeat)]

如何提高大熊猫的抵制速度

1 个答案: