比较2个数据帧中的值的日期以确定输出

时间:2017-01-21 05:40:32

标签: pandas dataframe

这是DataFrame 1:

Date    Serial Number   Type
0   2014-12-17  1N4AL2EP8DC270200   New
1   2015-10-28  1N4AL2EP8DC270200   Used
2   2015-01-22  1N4AL3AP1EN239307   New
3   2015-11-22  1N4AL3AP1EN239307   Used
4   2015-05-22  1N4AL3AP1FC235402   New
5   2016-12-02  1N4AL3AP1FC235402   Used
6   2015-01-22  1N4AL3AP2FC213098   New
7   2016-05-13  1N4AL3AP2FC213098   Used
8   2014-05-14  1N4AL3AP3EC132416   New
9   2016-04-07  1N4AL3AP3EC132416   Used
10  2014-05-24  1N4AL3AP5EC316644   New
11  2014-12-18  1N4AL3AP5EC316644   Used
12  2014-12-11  1N4AL3AP6EC322517   New
13  2015-10-04  1N4AL3AP6EC322517   Used
14  2016-06-06  1N4AL3AP6EC322517   Used
...

这是DataFrame 2:

    Date    Serial Number
0   2014-03-12  5N1AA08C78N611573
1   2014-03-12  JN8AS5MT3EW604277
2   2014-03-12  1N6AF0LX5DN114710
3   2014-03-12  1N4AL3AP8DN447876
4   2014-03-12  JN8AZ1MU8AW021145
5   2014-03-12  JN1AZ4EH0AM500138
6   2014-03-12  JN8AF5MR3BT013548
7   2014-03-12  3N1AB61E17L629049
8   2014-03-12  3N1BC13E87L368844
9   2014-03-13  1N6AD07W95C431183
10  2014-03-13  1N6AA07A25N543180
11  2014-03-13  1N4CL2AP1BC110185
12  2014-03-13  JN8AZ1MW1BW181306
13  2014-03-13  5N1BV28U46N116791
...

刚刚给出了DataFrame的示例,而不是整个DataFrame。我需要检索其类型在DataFrame 1中使用的每个序列号的第一个日期(例如:对于序列号'1N4AL3AP6EC322517'2015-10-04是我正在寻找的日期。然后将此日期与如果DataFrame 2中的日期早于DataFrame 1中的日期,则在DataFrame 2中记录相同序列号的日期,标记为'A',否则用'B'标记。

必须为超过2000个序列号执行此操作,这是一种有效的方法吗?

1 个答案:

答案 0 :(得分:0)

我认为您可以使用merge_asof

print (df2)
         Date      Serial Number
0  2016-03-12  1N4AL3AP6EC322517
1  2013-03-12  1N4AL3AP5EC316644
2  2014-03-12  1N4AL3AP3EC132416
3  2016-08-12  1N4AL3AP2FC213098
4  2014-03-12  JN8AZ1MU8AW021145

#if necessary cast Date columns to datetime
df1.Date = pd.to_datetime(df1.Date)
df2.Date = pd.to_datetime(df2.Date)
#get first value of column Serial Number filtered by Used
df = df1[df1.Type == 'Used'].drop_duplicates(['Serial Number'])
print (df)
         Date      Serial Number  Type
1  2015-10-28  1N4AL2EP8DC270200  Used
3  2015-11-22  1N4AL3AP1EN239307  Used
5  2016-12-02  1N4AL3AP1FC235402  Used
7  2016-05-13  1N4AL3AP2FC213098  Used
9  2016-04-07  1N4AL3AP3EC132416  Used
11 2014-12-18  1N4AL3AP5EC316644  Used
13 2015-10-04  1N4AL3AP6EC322517  Used

#add value B
df2['Mark'] = 'B'
df = pd.merge_asof(df.sort_values(['Date']), 
                   df2.sort_values(['Date']), on='Date', by='Serial Number')
print (df)
        Date      Serial Number  Type Mark
0 2014-12-18  1N4AL3AP5EC316644  Used    B
1 2015-10-04  1N4AL3AP6EC322517  Used  NaN
2 2015-10-28  1N4AL2EP8DC270200  Used  NaN
3 2015-11-22  1N4AL3AP1EN239307  Used  NaN
4 2016-04-07  1N4AL3AP3EC132416  Used    B
5 2016-05-13  1N4AL3AP2FC213098  Used  NaN
6 2016-12-02  1N4AL3AP1FC235402  Used  NaN
#add value A
mask = df['Serial Number'].isin(df2['Serial Number'])
df.loc[mask, 'Mark'] = df.loc[mask, 'Mark'].fillna('A')
print (df)
        Date      Serial Number  Type Mark
0 2014-12-18  1N4AL3AP5EC316644  Used    B
1 2015-10-04  1N4AL3AP6EC322517  Used    A
2 2015-10-28  1N4AL2EP8DC270200  Used  NaN
3 2015-11-22  1N4AL3AP1EN239307  Used  NaN
4 2016-04-07  1N4AL3AP3EC132416  Used    B
5 2016-05-13  1N4AL3AP2FC213098  Used    A
6 2016-12-02  1N4AL3AP1FC235402  Used  NaN