在条件下在df中添加和填充行

时间:2020-08-17 07:37:18

标签: python pandas dataframe

我有这样的df:

Timestamp                                 Time  Power    Total Energy              ID     Energy
2020-04-09 06:45:00 2020-04-09 06:40:40.559719   7500       5636690.0               1      140.0    
2020-04-09 06:46:00 2020-04-09 06:40:40.559719   7500       5636710.0               1      160.0    
2020-04-09 06:47:00                        NaT    NaN             NaN             NaN        NaN    
2020-04-09 06:48:00 2020-04-09 06:40:40.559719   7500       5636960.0               1      410.0
2020-04-09 06:49:00                        NaT    NaN             NaN             NaN        NaN
2020-04-09 06:50:00                        NaT    NaN             NaN             NaN        NaN
2020-04-09 06:51:00                        NaT    NaN             NaN             NaN        NaN
...                                        ...    ...             ...             ...        ...
2020-04-30 23:55:00 2020-04-29 16:30:38.559871   7500      18569270.0               5      100.0
2020-04-30 23:54:00                        NaT    NaN             NaN             NaN        NaN
2020-04-30 23:55:00 2020-04-29 16:30:38.559871   7500      18569370.0               5      180.0

我必须调整/添加一些值:

  1. 为df ['Time']> df ['Timestamp']添加行:间隔1分钟df ['Timestamp']; df ['Time'] = df ['Time']的条目; df ['Power'] = df ['Energy'] /(delta t(=时间与现有时间戳之间的差异(以小时为单位))); df ['Total Energy'],df ['ID']和df ['Energy'] lik df ['time']
  2. 在时间不变的区域(使用填充或填充)填充NaN / NaT值
  3. 分别填充两个不同的df ['Time']条目之间的NaN / Nat值为0,最后一个条目(填充)分别为df ['Total Energy']

预期结果:

Timestamp                                 Time  Power    Total Energy              ID     Energy
2020-04-09 06:41:00 2020-04-09 06:40:40.559719   2100       5636690.0               1      140.0    
2020-04-09 06:42:00 2020-04-09 06:40:40.559719   2100       5636690.0               1      140.0    
2020-04-09 06:43:00 2020-04-09 06:40:40.559719   2100       5636690.0               1      140.0    
2020-04-09 06:44:00 2020-04-09 06:40:40.559719   2100       5636690.0               1      140.0
2020-04-09 06:45:00 2020-04-09 06:40:40.559719   7500       5636690.0               1      140.0    
2020-04-09 06:46:00 2020-04-09 06:40:40.559719   7500       5636710.0               1      160.0    
2020-04-09 06:47:00 2020-04-09 06:40:40.559719   7500       5636710.0               1      160.0    
2020-04-09 06:48:00 2020-04-09 06:40:40.559719   7500       5636960.0               1      410.0
2020-04-09 06:49:00                         -       0       5636960.0               -          0
2020-04-09 06:50:00                         -       0       5636960.0               -          0
2020-04-09 06:51:00                         -       0       5636960.0               -          0
...                                        ...    ...             ...             ...        ...
2020-04-30 23:55:00 2020-04-29 16:30:38.559871   7500      18569270.0               5      100.0
2020-04-30 23:54:00 2020-04-29 16:30:38.559871   7500      18569270.0               5      100.0
2020-04-30 23:55:00 2020-04-29 16:30:38.559871   7500      18569370.0               5      180.0

我认为在某些情况下解决方案与ffill()有关,但不幸的是,我不知道该如何制定。

编辑: 这是我的代码示例:

df = pd.DataFrame({"Time": ["2020-04-09 06:40:40.559719","2020-04-09 06:40:40.559719", 'NaT', "2020-04-09 06:40:40.559719", 'NaT', 'NaT', 'NaT', '2020-04-09 16:50:38.559871', 'NaT', '2020-04-29 16:50:38.559871'],
              "Power": [7500, 6000, 'NaN', 6000, 'NaN', 'NaN', 'NaN', 3600, 'NaN', 4200],
              "Total Energy": [5000, 5100, 'NaN', 5300, 'NaN', 'NaN', 'NaN', 5360, 'NaN', 5500],
              "ID": [1, 1, 'NaN', 1, 'NaN', 'NaN', 'NaN', 2, 'NaN', 2],
              "Energy": [500, 600, 'NaN', 800, 'NaN', 'NaN', 'NaN', 60, 'NaN', 200]},
              index=pd.date_range(start = "2020-04-09 6:45", periods = 10, freq = 'T'))

df['Time'] = pd.to_datetime(df['Time'])
df['Power'] = pd.to_numeric(df['Power'], errors = 'coerce')
df['Total Energy'] = pd.to_numeric(df['Total Energy'], errors = 'coerce')
df['ID'] = pd.to_numeric(df['ID'], errors = 'coerce')
df['Energy'] = pd.to_numeric(df['Energy'], errors = 'coerce')

df

预期的结果:

                    Time                       Power    Total Energy    ID  Energy
2020-04-09 06:41:00 2020-04-09 06:40:40.559719   0      4500.0          1.0 0
2020-04-09 06:42:00 2020-04-09 06:40:40.559719   7500.0 4625.0          1.0 125.0
2020-04-09 06:43:00 2020-04-09 06:40:40.559719   7500.0 4750.0          1.0 250.0
2020-04-09 06:44:00 2020-04-09 06:40:40.559719   7500.0 4875.0          1.0 375.0
2020-04-09 06:45:00 2020-04-09 06:40:40.559719   7500.0 5000.0          1.0 500.0
2020-04-09 06:46:00 2020-04-09 06:40:40.559719   6000.0 5100.0          1.0 600.0
2020-04-09 06:47:00 2020-04-09 06:40:40.559719   6000.0 5200.0          1.0 700.0
2020-04-09 06:48:00 2020-04-09 06:40:40.559719   6000.0 5300.0          1.0 800.0
2020-04-09 06:49:00 -                           0       5300.0          -   0
2020-04-09 06:50:00 -                           0       5300.0          -   0
2020-04-09 06:51:00 2020-04-09 16:50:38.559871  0       5300.0          2.0 0
2020-04-09 06:52:00 2020-04-09 16:50:38.559871  3600.0  5360.0          2.0 60.0
2020-04-09 06:53:00 2020-04-09 16:50:38.559871  4200.0  5430.0          2.0 130.0
2020-04-09 06:54:00 2020-04-29 16:50:38.559871  4200.0  5500.0          2.0 200.0
  1. df ['Time']:创建新行,直到df ['Timestamp'] = df ['Time']
  2. 填充新行:第一行df ['Energy'] = 0,而不是线性填充;第一行的df ['Power'] = 0,而不是df ['Power'] = df ['Energy'] /(1/60); df ['Time']和df ['ID']用bfill()填充; df ['Total Energy'] = df ['Energy']的总和
  3. 两个不同时间之间的界线:按预期结果填充
  4. 时间序列中的NaN值(例如@ 2020-04-09 06:47:00):带有ffill()的df ['Time']和df ['ID']; df ['Energy'] =现有线之间的差异(如果有更多的NaN线->线性插入); df ['Total Energy'] =旧值+ df ['Energy']; df ['Power'] = df ['Energy'] /(1/60)

感谢您的帮助

1 个答案:

答案 0 :(得分:0)

在我看来,可能需要一些不同的功能:

  1. 增量t::可以使用shift()获取超前或滞后值,然后可以计算出差异。
  2. 要填充NaN / NaT值,可以使用fillna()
    填充:df['Column'].fillna(val, method='bfill')
    填充: df['Column'].fillna(val, method='ffill')
  3. 可以如上所述使用fillna。然后,可以根据条件用新值覆盖该列: np.where(condition, value if condition met, value if condition not met)

例如,要在能量列完成后创建总能量列,可以使用:

# 1. First fill na with ffill method'
df['Total Energy'].fillna(method='ffill', inplace=True)
# 2. Find deltas
df['energy_delta'] = df['Energy'] - df['Energy'].shift(1)
df['t_energy_delta'] = df['Total Energy'] - df['Total Energy'].shift(1)
# 3. Correct total_energy column to take into account delta
df['Total Energy'] = np.where(df['energy_delta']>df['t_energy_delta'], df['Total Energy']+df['energy_delta'], df['Total Energy'])

这有点冗长,但我认为它将完成工作。也许有更好的方法。