我有一个DataFrame,其中包含以下列:
DeviceId | 时间戳 |的 Total_Data
001 08/12/2014 500
001 08/13/2014 600
001 08/14/2014 750
001 08/15/2014 150(此处重新启动设备)(正确值:750 + 150)
001 08/16/2014 300(正确值:750 + 150 + 300)
002 10/01/2014 98
...
..
对于一堆不同的设备,我有他们在不同场合消耗的数据(以时间戳记表示)。
Total_Data 列本质上是累积的,因此,对于给定设备,计算随时间消耗的总数据。例如,如果设备A在12 August 2012
上使用3KB而在14 August 2012
上使用5KB,则DataFrame将具有两个条目,第二个条目的 Total_Data 值为8KB。
然而,故障是重启设备时累积值重置为0(并再次开始计数)。因此,需要纠正。在Pandas中更改当前DataFrame以解决此问题的最佳方法是什么
到目前为止,我已经考虑过逐行迭代DataFrame,但它看起来太复杂了。
答案 0 :(得分:0)
代码如下:
grouped = df.groupby((df.TotalData.diff() <= 0).cumsum())
parts = [g.reset_index(drop=True) for k, g in grouped]
for i in range(1, len(parts)):
parts[i]['TotalData']=parts[i]['TotalData'].cumsum().add(parts[i-1]['TotalData'].max())
DF = pd.concat(parts)
print DF
结果:
Date TotalData
0 2014-08-12 500
1 2014-08-13 600
2 2014-08-14 750
0 2014-08-15 900
1 2014-08-16 1200
答案 1 :(得分:-1)
这是解决问题的代码示例。 我假设如果有重启,那么TotalData值小于之前的值。
df = pd.DataFrame({'Date': pd.Series([datetime(2014, 8, 12),
datetime(2014, 8, 13),
datetime(2014, 8, 14),
datetime(2014, 8, 15),
datetime(2014, 8, 16)]),
'TotalData': pd.Series([500, 600, 750, 150, 300])
})
df['PrevTotalData'] = df['TotalData'].shift(1)
df.loc[0, 'PrevTotalData'] = 0
# Assuming here that the amount of data day after reboot is always less then
# the total amount of data in previous day
rebooted = df['PrevTotalData'] > df['TotalData']
df['DataBeforeLastReboot'] = 0
df.ix[rebooted, 'DataBeforeLastReboot'] = df['PrevTotalData']
df['DataBeforeLastReboot'] = df['DataBeforeLastReboot'].cumsum()
df['TotalDataFixed'] = df['TotalData'] + df['DataBeforeLastReboot']
print(df)
之前的数据框:
Date TotalData
0 2014-08-12 500
1 2014-08-13 600
2 2014-08-14 750
3 2014-08-15 150
4 2014-08-16 300
之后:
Date TotalData PrevTotalData DataBeforeLastReboot TotalDataFixed
0 2014-08-12 500 0 0 500
1 2014-08-13 600 500 0 600
2 2014-08-14 750 600 0 750
3 2014-08-15 150 750 750 900
4 2014-08-16 300 150 750 1050
如果数据框由不同的机器组成,则解决方案有点复杂。
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'Machine': pd.Series([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]),
'Date': pd.Series([datetime(2014, 8, 12),
datetime(2014, 8, 13),
datetime(2014, 8, 14),
datetime(2014, 8, 15),
datetime(2014, 8, 16),
datetime(2014, 8, 12),
datetime(2014, 8, 13),
datetime(2014, 8, 14),
datetime(2014, 8, 15),
datetime(2014, 8, 16)]),
'TotalData': pd.Series([500, 600, 750, 150, 300,
100, 200, 300, 100, 200])
})
df = df.reset_index(drop=True)
df = df.set_index(['Machine', 'Date'])
grouped_by_machine = df.groupby(level=[0])
df['PrevTotalData'] = grouped_by_machine['TotalData'].shift(1)
df['PrevTotalData'] = df['PrevTotalData'].fillna(value=0)
# Assuming here that the amount of data day after reboot is always less then
# the total amount of data in previous day
rebooted = df['PrevTotalData'] > df['TotalData']
df['DataBeforeLastReboot'] = 0
df.ix[rebooted, 'DataBeforeLastReboot'] = df['PrevTotalData']
df['DataBeforeLastReboot'] = grouped_by_machine['DataBeforeLastReboot'].cumsum()
df['TotalDataFixed'] = df['TotalData'] + df['DataBeforeLastReboot']
print(df)
之前的数据框:
TotalData
Machine Date
0 2014-08-12 500
2014-08-13 600
2014-08-14 750
2014-08-15 150
2014-08-16 300
1 2014-08-12 100
2014-08-13 200
2014-08-14 300
2014-08-15 100
2014-08-16 200
之后:
TotalData PrevTotalData DataBeforeLastReboot TotalDataFixed
Machine Date
0 2014-08-12 500 0 0 500
2014-08-13 600 500 0 600
2014-08-14 750 600 0 750
2014-08-15 150 750 750 900
2014-08-16 300 150 750 1050
1 2014-08-12 100 0 0 100
2014-08-13 200 100 0 200
2014-08-14 300 200 0 300
2014-08-15 100 300 300 400
2014-08-16 200 100 300 500