嗨,我有一个数据集:
BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,1/1/2018,12:06:20 AM ,7206
1/1/2018,72,06:04:33,7208,1/1/2018,12:36:31 AM,7205
1/1/2018,72,06:21:07,7216,1/1/2018,5:53:49 AM,7220
1/1/2018,80,06:29:01,8026,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:30:54,7218,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:33:54,7221,1/1/2018,06:21:17 AM,7216
1/1/2018,80,06:35:26,8018,1/1/2018,06:31:04 AM,7218
1/1/2018,72,09:38:34,7211,1/1/2018,1:40:38 PM,7209
1/1/2018,72,13:39:45,7209,,,
我想做一个循环以匹配2个条件(OID与VID相同,时间ArrTime与最接近的TTime匹配)
所需结果将类似于如果条件满足
BDate,Snum,ArrTime,OID,TDate,TTime,VID
1/1/2018,72,05:59:01,7214,1/1/2018,5:59:10 AM,7214
1/1/2018,72,06:04:33,7208,1/1/2018,6:04:55 AM,7208
1/1/2018,72,06:21:07,7216,1/1/2018,06:21:17 AM,7216
1/1/2018,72,06:30:54,7218,1/1/2018,06:31:04 AM,7218
1/1/2018,72,13:39:45,7209,1/1/2018,1:40:38 PM,7209
其他将其打印在另一个文件上
BDate,Snum,ArrTime,OID
1/1/2018,80,06:29:01,8026
1/1/2018,80,06:35:26,8018
1/1/2018,72,09:38:34,7211
想问我是否需要对Pandas,dataframe进行处理,或者我可以在没有这些库的情况下正常进行处理。需要一个方向开始!谢谢,如果我开始输入任何代码,将会更新问题!
已编辑:多余的两行数据为空字段
答案 0 :(得分:1)
使用merge_asof
。
第一个to_datetime
和参数format
与mask
一起使用,以正确解析格式AM/PM
:
df['Date1'] = pd.to_datetime(df['BDate'] + ' ' + df['ArrTime'], format='%d/%m/%Y %H:%M:%S')
datesAM = pd.to_datetime(df['TDate'] + ' ' + df['TTime'], format='%d/%m/%Y %I:%M:%S %p')
datesPM = pd.to_datetime(df['TDate'] + ' ' + df['TTime'], format='%d/%m/%Y %H:%M:%S %p')
df['Date2'] = datesAM.mask(df['TTime'].str.endswith('AM', na=False), datesPM)
print (df)
BDate Snum ArrTime OID TDate TTime VID \
0 1/1/2018 72 05:59:01 7214 1/1/2018 12:06:20 AM 7206.0
1 1/1/2018 72 06:04:33 7208 1/1/2018 12:36:31 AM 7205.0
2 1/1/2018 72 06:21:07 7216 1/1/2018 5:53:49 AM 7220.0
3 1/1/2018 80 06:29:01 8026 1/1/2018 5:59:10 AM 7214.0
4 1/1/2018 72 06:30:54 7218 1/1/2018 6:04:55 AM 7208.0
5 1/1/2018 72 06:33:54 7221 1/1/2018 06:21:17 AM 7216.0
6 1/1/2018 80 06:35:26 8018 1/1/2018 06:31:04 AM 7218.0
7 1/1/2018 72 09:38:34 7211 1/1/2018 1:40:38 PM 7209.0
8 1/1/2018 72 13:39:45 7209 NaN NaN NaN
Date1 Date2
0 2018-01-01 05:59:01 2018-01-01 12:06:20
1 2018-01-01 06:04:33 2018-01-01 12:36:31
2 2018-01-01 06:21:07 2018-01-01 05:53:49
3 2018-01-01 06:29:01 2018-01-01 05:59:10
4 2018-01-01 06:30:54 2018-01-01 06:04:55
5 2018-01-01 06:33:54 2018-01-01 06:21:17
6 2018-01-01 06:35:26 2018-01-01 06:31:04
7 2018-01-01 09:38:34 2018-01-01 13:40:38
8 2018-01-01 13:39:45 NaT
然后按子集选择,删除缺失值并进行排序:
df1 = df[['Date1','Snum', 'OID']].sort_values('Date1').dropna(subset=['OID'])
df1['OID'] = df1['OID'].astype(np.int64)
print (df1)
Date1 Snum OID
0 2018-01-01 05:59:01 72 7214
1 2018-01-01 06:04:33 72 7208
2 2018-01-01 06:21:07 72 7216
3 2018-01-01 06:29:01 80 8026
4 2018-01-01 06:30:54 72 7218
5 2018-01-01 06:33:54 72 7221
6 2018-01-01 06:35:26 80 8018
7 2018-01-01 09:38:34 72 7211
8 2018-01-01 13:39:45 72 7209
df2 = df[['Date2','VID']].sort_values('Date2').dropna(subset=['VID'])
df2['VID'] = df2['VID'].astype(np.int64)
print (df2)
# Date2 VID
2 2018-01-01 05:53:49 7220
3 2018-01-01 05:59:10 7214
4 2018-01-01 06:04:55 7208
5 2018-01-01 06:21:17 7216
6 2018-01-01 06:31:04 7218
0 2018-01-01 12:06:20 7206
1 2018-01-01 12:36:31 7205
7 2018-01-01 13:40:38 7209
df3 = pd.merge_asof(df1,
df2,
left_on='Date1',
right_on='Date2',
left_by='OID',
right_by='VID',
direction='forward'
)
最后删除丢失的行并将VID
列转换为整数:
df3 = df3.dropna(subset=['VID'])
df3['VID'] = df3['VID'].astype(int)
print (df3)
Date1 Snum OID Date2 VID
0 2018-01-01 05:59:01 72 7214 2018-01-01 05:59:10 7214
1 2018-01-01 06:04:33 72 7208 2018-01-01 06:04:55 7208
2 2018-01-01 06:21:07 72 7216 2018-01-01 06:21:17 7216
4 2018-01-01 06:30:54 72 7218 2018-01-01 06:31:04 7218
8 2018-01-01 13:39:45 72 7209 2018-01-01 13:40:38 7209