我有两个DataFrame,我需要合并两个,我需要添加一个列来指定它是否被接受。
我有这个:
dfa[dfa.CONTROL.isin([334030860978638])]
Out[107]:
CONTROL A B DATE_HOUR
1629136 334030860978638 525562414612 52447860015000 2015-08-02 16:32:00
1629137 334030860978638 525562414612 52447860015000 2015-08-02 16:42:32
1629138 334030860978638 525562414612 52447860015000 2015-08-02 18:33:12
1629139 334030860978638 525562414612 52447860015000 2015-08-03 19:40:19
dfb[dfb.control.isin([334030860978638])]
Out[108]:
control a b date_hour
id
299366338 334030860978638 525562414612 447860015000 2015-08-02 16:33:08
299392621 334030860978638 525562414612 447860015000 2015-08-02 16:43:40
299665465 334030860978638 525562414612 447860015000 2015-08-02 18:34:21
view = dfa.merge(dfb, left_on=['CONTROL', 'A', 'B'],
right_on=['control', 'a', 'b'], how='outer')
我需要将DATE_HOUR与date_hour进行比较,如果记录在时间范围内,例如3600秒,我还需要确定是否存在多个记录,然后我会得到最近的记录并将其标记为新的列接受我将设置为True,否则为False。
我的预期输出:
CONTROL A B DATE_HOUR control a b date_hour accepted
334030860978638 525562414612 52447860015000 2015-08-02 16:32:00 334030860978638 525562414612 52447860015000 2015-08-02 16:32:08 True
334030860978638 525562414612 52447860015000 2015-08-02 16:42:32 334030860978638 525562414612 52447860015000 2015-08-02 16:43:40 True
334030860978638 525562414612 52447860015000 2015-08-02 18:33:12 334030860978638 525562414612 52447860015000 2015-08-02 18:34:21 True
334030860978638 525562414612 52447860015000 2015-08-03 19:40:19 NaN NaN Nan NaT False
我可以将apply方法用于此任务吗?有人可以帮助我以正确的方式使用pandas。
答案 0 :(得分:0)
这similar problem帮助我解决了我的问题。
def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'):
match = match[match[groupname] == group.name]
try:
nbrs = NearestNeighbors(1).fit(match[rname].values[:, None])
dist, ind = nbrs.kneighbors(group[lname].values[:, None])
group[lname] = group[lname]
group[rname] = match[rname].values[ind.ravel()]
time_diff = (group[rname] - group[lname]) / np.timedelta64(1, 'm')
group[name_field_diff] = time_diff.abs()
except:
pass
return group
d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}]
d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}]
df1 = pd.DataFrame(d1)
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S')
df2 = pd.DataFrame(d2)
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S')
df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour')
A B CONTROL DATE_HOUR date_hour diff_minutes
0 525562414612 52447860015000 334030860978638 2015-08-02 16:32:00 2015-08-02 16:33:08 1.133333
1 525562414612 52447860015000 334030860978638 2015-08-02 16:42:32 2015-08-02 16:43:40 1.133333
2 525562414612 52447860015000 334030860978638 2015-08-02 18:33:12 2015-08-02 18:34:21 1.150000
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19 2015-08-02 18:34:21 65.966667
现在我使用我的间隙过滤来确定哪些记录不适合。
df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)]
A B CONTROL DATE_HOUR
3 525562414612 52447860015000 334030860978638 2015-08-02 19:40:19