我得到了一个包含累计个计数数据的数据框。生成一个示例,如下所示(随意跳过:
import numpy as np
import pandas as pd
cols = ['Start', 'End', 'Count']
data = np.array([
'2020-1-1', '2020-1-2', 4,
'2020-1-1', '2020-1-3', 6,
'2020-1-1', '2020-1-4', 8,
'2020-2-1', '2020-2-2', 3,
'2020-2-1', '2020-2-3', 4,
'2020-2-1', '2020-2-4', 4])
data = data.reshape((6,3))
df = pd.DataFrame(columns=cols, data=data)
df['Start'] = pd.to_datetime(df.Start)
df['End'] = pd.to_datetime(df.End)
这将提供以下数据框:
Start End Count
2020-1-1 2020-1-2 4
2020-1-1 2020-1-3 6
2020-1-1 2020-1-4 8
2020-2-1 2020-2-2 3
2020-2-1 2020-2-3 4
2020-2-1 2020-2-4 4
计数是累积的(累积从“开始”开始),我想撤消累积以获得(注意日期的变化):
Start End Count
2020-1-1 2020-1-2 4
2020-1-2 2020-1-3 2
2020-1-3 2020-1-4 2
2020-2-1 2020-2-2 3
2020-2-2 2020-2-3 1
2020-2-3 2020-2-4 0
我想对分组变量执行此操作。可以通过以下方法天真地完成此操作:
lst = []
for start, data in df.groupby(['Start', 'grouping_variable']):
data = data.sort_values('End')
diff = data.Count.diff()
diff.iloc[0] = data.Count.iloc[0]
start_dates = [data.Start.iloc[0]] + list(data.end[:-1].values)
data = data.assign(Start=start_dates,
Count=diff)
lst.append(data)
df = pd.concat(lst)
这不会以任何方式感觉到“正确”,“ pythonic”或“干净”。有没有更好的办法?也许熊猫有特定的方法可以做到这一点?
答案 0 :(得分:1)
IIUC,我们可以将cumcount
与布尔值结合使用来捕获每个唯一的开始日期组,然后对每个组使用np.where
进行shift
操作。
import numpy as np
#df['Count'] = df['Count'].astype(int)
s = df.groupby(['Start']).cumcount() == 0
df['Count'] = np.where(s,df['Count'],df['Count'] - df['Count'].shift())
df['Start'] = np.where(s, df['Start'], df['End'].shift(1))
print(df)
Start End Count
0 2020-01-01 2020-01-02 4.0
1 2020-01-02 2020-01-03 2.0
2 2020-01-03 2020-01-04 2.0
3 2020-02-01 2020-02-02 3.0
4 2020-02-02 2020-02-03 1.0
5 2020-02-03 2020-02-04 0.0