只有条件为真时,Pandas才会合并

时间:2015-11-24 20:02:55

标签: python pandas

我有两个DataFrame,我需要合并两个,我需要添加一个列来指定它是否被接受。

我有这个:

dfa[dfa.CONTROL.isin([334030860978638])]

Out[107]:
            CONTROL             A               B               DATE_HOUR
1629136     334030860978638     525562414612    52447860015000  2015-08-02 16:32:00
1629137     334030860978638     525562414612    52447860015000  2015-08-02 16:42:32
1629138     334030860978638     525562414612    52447860015000  2015-08-02 18:33:12
1629139     334030860978638     525562414612    52447860015000  2015-08-03 19:40:19


dfb[dfb.control.isin([334030860978638])]

Out[108]:
            control             a               b               date_hour
id              
299366338   334030860978638     525562414612    447860015000    2015-08-02 16:33:08
299392621   334030860978638     525562414612    447860015000    2015-08-02 16:43:40
299665465   334030860978638     525562414612    447860015000    2015-08-02 18:34:21

view = dfa.merge(dfb, left_on=['CONTROL', 'A', 'B'],
                right_on=['control', 'a', 'b'], how='outer')

我需要将DATE_HOUR与date_hour进行比较,如果记录在时间范围内,例如3600秒,我还需要确定是否存在多个记录,然后我会得到最近的记录并将其标记为新的列接受我将设置为True,否则为False。

我的预期输出:

 CONTROL            A               B               DATE_HOUR           control             a               b               date_hour           accepted
 334030860978638    525562414612    52447860015000  2015-08-02 16:32:00 334030860978638     525562414612    52447860015000  2015-08-02 16:32:08 True
 334030860978638    525562414612    52447860015000  2015-08-02 16:42:32 334030860978638     525562414612    52447860015000  2015-08-02 16:43:40 True
 334030860978638    525562414612    52447860015000  2015-08-02 18:33:12 334030860978638     525562414612    52447860015000  2015-08-02 18:34:21 True
 334030860978638    525562414612    52447860015000  2015-08-03 19:40:19 NaN                 NaN             Nan             NaT                 False

我可以将apply方法用于此任务吗?有人可以帮助我以正确的方式使用pandas。

1 个答案:

答案 0 :(得分:0)

similar problem帮助我解决了我的问题。

def nearest(group, match, groupname, lname, rname, name_field_diff='diff_minutes'):
    match = match[match[groupname] == group.name]
    try:
        nbrs = NearestNeighbors(1).fit(match[rname].values[:, None])
        dist, ind = nbrs.kneighbors(group[lname].values[:, None])
        group[lname] = group[lname]
        group[rname] = match[rname].values[ind.ravel()]
        time_diff = (group[rname] - group[lname]) / np.timedelta64(1, 'm')        
        group[name_field_diff] = time_diff.abs()
    except:
        pass
    return group

d1 = [{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:32:00'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 16:42:32'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 18:33:12'},
{'CONTROL':334030860978638, 'A': 525562414612, 'B': 52447860015000, 'DATE_HOUR': '2015-08-02 19:40:19'}]

d2 = [{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:33:08'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 16:43:40'},
{'control':334030860978638, 'a': 525562414612, 'b': 52447860015000, 'date_hour': '2015-08-02 18:34:21'}]

df1 = pd.DataFrame(d1)
df1.DATE_HOUR = pd.to_datetime(df1.DATE_HOUR, format='%Y-%m-%d %H:%M:%S')

df2 = pd.DataFrame(d2)
df2.date_hour = pd.to_datetime(df2.date_hour, format='%Y-%m-%d %H:%M:%S')

df1.groupby('CONTROL').apply(nearest, df2, 'control', 'DATE_HOUR', 'date_hour')

    A               B               CONTROL             DATE_HOUR               date_hour               diff_minutes
0   525562414612    52447860015000  334030860978638     2015-08-02 16:32:00     2015-08-02 16:33:08     1.133333
1   525562414612    52447860015000  334030860978638     2015-08-02 16:42:32     2015-08-02 16:43:40     1.133333
2   525562414612    52447860015000  334030860978638     2015-08-02 18:33:12     2015-08-02 18:34:21     1.150000
3   525562414612    52447860015000  334030860978638     2015-08-02 19:40:19     2015-08-02 18:34:21     65.966667

现在我使用我的间隙过滤来确定哪些记录不适合。

df1[df1.index.isin(view[(view.diff_minutes >= 60)].index)]

    A               B               CONTROL             DATE_HOUR
3   525562414612    52447860015000  334030860978638     2015-08-02 19:40:19