我有一个df,其中包含有关各个地方的信息。
import pandas as pd
d = ({
'C' : ['08:00:00','XX','08:10:00','XX','08:41:42','XX','08:50:00','XX', '09:00:00', 'XX','09:15:00','XX','09:21:00','XX','09:30:00','XX','09:40:00','XX'],
'D' : ['Home','','Home','','Away','','Home','','Away','','Home','','Home','','Away','','Home',''],
'E' : ['Num:','','Num:','','Num:','','Num:','','Num:', '','Num:','','Num:','','Num:', '','Num:', ''],
'F' : ['1','','1','','1','','1','','1', '','2','','2','','1', '','2',''],
'A' : ['A','','A','','A','','A','','A','','A','','A','','A','','A',''],
'B' : ['Stop','','Res','','Stop','','Start','','Res','','Stop','','Res','','Start','','Start','']
})
df = pd.DataFrame(data=d)
我想将这些数据导出到各自的位置,这些位置标有Column D
。我还想根据Column B
中标记的功能添加新列。
df['C'] = pd.to_timedelta(df['C'], errors="coerce").dt.total_seconds()
incl = ['Home', 'Away']
for k, g in df[df.D.isin(incl)].groupby('D'):
Stop = g.loc[df['B'] == 'Stop'].reset_index()['C']
Start = g.loc[df['B'] == 'Start'].reset_index()['C']
Res = g.loc[df['B'] == 'Res'].reset_index()['C']
g['Start_diff'] = Start - Stop
g['Res_diff'] = Start - Res
问题是这些功能多次出现,标记在Column F
中。因此,如果我们查看Home
的导出,我们会在Column F
中首次获得差异。
输出:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
而我希望预期的输出是:
A B C D E F Start_diff Res_diff
0 A Stop 28800 Home Num: 1 3000 2400
2 A Res 29400 Home Num: 1
6 A Start 31800 Home Num: 1
10 A Stop 33300 Home Num: 2 1500 1200
12 A Res 33660 Home Num: 2
16 A Start 34800 Home Num: 2
我试图将for k, g in df[df.D.isin(incl)].groupby('D'):
改为for k, g in df[df.D.isin(incl)].groupby('D').F.nunique():
但我收到错误TypeError: 'numpy.int64' object is not iterable
答案 0 :(得分:0)
我认为需要D
和F
列的groupby自定义功能,并且mask
替换重复值:
def f(g):
Stop = g.loc[df['B'] == 'Stop', 'C']
Start = g.loc[df['B'] == 'Start', 'C']
Res = g.loc[df['B'] == 'Res', 'C']
g['Start_diff'] = Start.values[0] - Stop.values[0]
g['Res_diff'] = Start.values[0] - Res.values[0]
return (g)
df = df[df.D.isin(incl)].groupby(['D', 'F']).apply(f)
df[['Start_diff', 'Res_diff']] = df[['Start_diff', 'Res_diff']].mask(df.duplicated(['D','F']))
print (df)
C D E F A B Start_diff Res_diff
0 28800.0 Home Num: 1 A Stop 3000.0 2400.0
2 29400.0 Home Num: 1 A Res NaN NaN
4 31302.0 Away Num: 1 A Stop 2898.0 1800.0
6 31800.0 Home Num: 1 A Start NaN NaN
8 32400.0 Away Num: 1 A Res NaN NaN
10 33300.0 Home Num: 2 A Stop 1500.0 1140.0
12 33660.0 Home Num: 2 A Res NaN NaN
14 34200.0 Away Num: 1 A Start NaN NaN
16 34800.0 Home Num: 2 A Start NaN NaN