如何在熊猫DataFrame的每一行上运行函数

时间:2020-10-16 06:38:20

标签: python pandas dataframe

我这样有dataframe_1

Index   Time          Label
0       0.000 ns      Segment 1
1       2.749 sec     baseline
2       3.459 min     begin test
3       7.009 min     end of test

我想在dataframe_1的每一行之间添加多个新行,其中每个新行的“时间”列将增加一分钟,直到达到dataframe_1的下一行为止(以及相应的标签)。例如,上表最终应该看起来像这样:

Index     Time               Label
0         0.000 ns           Segment 1
1         2.749 sec          baseline
2         00:01:02.749000    baseline + 1min
3         00:02:02.749000    baseline + 2min
4         00:03:02.749000    baseline + 3min
5         3.459 min          begin test
6         00:04:27.540000    begin test + 1min
7         00:05:27.540000    begin test + 2min
8         00:06:27.540000    begin test + 3min
9         7.009 min          end of test

通过Timedelta使用pd.to_timedelta()类型很好。

我认为最好的方法是将dataframe_1的每一行分解成自己的数据帧,然后每增加一分钟添加几行,然后concat将数据帧放回原处。但是,我不确定如何做到这一点。

我是否应该使用嵌套的for循环[首先]遍历dataframe_1的行,然后[第二]遍历计数器,以便可以增加分钟数来创建新行?

我以前没有将单独的行拆分为新的数据帧,并且正在执行第二次迭代,如下所示:

    baseline_row = df_legend[df_legend['Label'] == 'baseline']
    [baseline_index] = baseline_row.index
    baseline_time = baseline_row['Time']

    interval_mins = 1
    new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

    cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
    cutoff_time = pd.to_timedelta(cutoff_time_np)
    
    while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):

        new_row = baseline_row.copy()
        new_row['Label'] = f'minute {interval_mins}'
        new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
        new_row.index = [baseline_index + interval_mins - 0.5]

        df_legend = df_legend.append(new_row, ignore_index=False)
        df_legend = df_legend.sort_index().reset_index(drop=True)
        pdb.set_trace()

        interval_mins += 1
        new_time = baseline_time + pd.Timedelta(minutes=interval_mins)

但是由于我想对原始dataframe_1中的每一行执行此操作,因此我正在考虑将其拆分为单独的数据帧,然后将其重新组合在一起。我只是不确定最好的方法是什么,特别是因为如果遍历行,熊猫显然非常慢。

我真的很感谢一些指导。

2 个答案:

答案 0 :(得分:1)

这可能比您的解决方案要快。

df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)

df

             Time        Label  counts
0        00:00:00    Segment 1       0
1 00:00:02.749000     baseline       3
2 00:03:27.540000   begin test       3
3 00:07:00.540000  end of test       0

然后使用iterrows获得期望的输出。

new_df = []
for _, row in df.iterrows():
    val = row.counts
    if val == 0:
        new_df.append(row)
    else:
        new_df.append(row)
        new_row = row.copy()
        label = row.Label
        for i in range(val):
            new_row = new_row.copy()
            new_row.Time += pd.Timedelta('1 min')
            new_row.Label = f'{label} + {i+1}min'
            new_df.append(new_row)

new_df = pd.DataFrame(new_df)
new_df

             Time              Label  counts
0        00:00:00          Segment 1       0
1 00:00:02.749000           baseline       3
1 00:01:02.749000    baseline + 1min       3
1 00:02:02.749000    baseline + 2min       3
1 00:03:02.749000    baseline + 3min       3
2 00:03:27.540000         begin test       3
2 00:04:27.540000  begin test + 1min       3
2 00:05:27.540000  begin test + 2min       3
2 00:06:27.540000  begin test + 3min       3
3 00:07:00.540000        end of test       0

答案 1 :(得分:1)

我假设您将 Time 列从“数字单位”格式转换为字符串 时间的表示。像这样:

               Time        Label
Index                           
0      00:00:00.000    Segment 1
1      00:00:02.749     baseline
2      00:03:27.540   begin test
3      00:07:00.540  end of test

然后,得到您的结果:

  1. 计算 timNxt -将 Time 列移动1位并进行转换 到 datetime

    timNxt = pd.to_datetime(df.Time.shift(-1))
    
  2. 定义以下“复制”功能:

    def myRepl(row):
        timCurr = pd.to_datetime(row.Time)
        timNext = timNxt[row.name]
        tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
        if pd.notna(timNext):
            n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
            tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
                row.Label + f' + {i}min'] for i in range(1, n)])
        return pd.DataFrame(tbl, columns=row.index)
    
  3. 将其应用于 df 的每一行并合并结果:

    result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
    

结果是:

              Time              Label
0  00:00:00.000000          Segment 1
1  00:00:02.749000           baseline
2  00:01:02.749000    baseline + 1min
3  00:02:02.749000    baseline + 2min
4  00:03:02.749000    baseline + 3min
5  00:03:27.540000         begin test
6  00:04:27.540000  begin test + 1min
7  00:05:27.540000  begin test + 2min
8  00:06:27.540000  begin test + 3min
9  00:07:00.540000        end of test

生成的DataFrame的 Time 列也为 string ,但位于 至少秒的小数部分到处都有6位数字。