我有2个CSV文件,如下所示。
Difference
,在其中...
Book_date
... App_date
的日期范围内:Difference
=差异App_date
和Occur_date
csv_1
Mobile_Number Book_Date App_Date
503477334 2018-10-12 2018-10-18
506002884 2018-10-12 2018-10-19
501022162 2018-10-12 2018-10-16
503487338 2018-10-13 2018-10-13
506012887 2018-10-13 2018-10-21
503427339 2018-10-14 2018-10-17
csv_2
Mobile_Number Occur_Date
503477334 2018-10-16
506002884 2018-10-21
501022162 2018-10-15
503487338 2018-10-13
501428449 2018-10-18
506012887 2018-10-14
我想在csv_1中添加一个新列,如果移动电话号码出现在csv_2中Book_date和App_date的日期范围内,则App_date与Occur_date或NaN之间的差异(如果不在该日期范围内出现)。输出应为
输出
Mobile_Number Book_Date App_Date Difference
503477334 2018-10-12 2018-10-18 2
506002884 2018-10-12 2018-10-19 -2
501022162 2018-10-12 2018-10-16 1
503487338 2018-10-13 2018-10-13 0
506012887 2018-10-13 2018-10-21 7
503427339 2018-10-14 2018-10-17 NaN
编辑:
如果我想根据上述两个csv文件中的唯一类别和mobile_number对其进行过滤。怎么做?
csv_1
Category Mobile_Number Book_Date App_Date
A 503477334 2018-10-12 2018-10-18
B 503477334 2018-10-07 2018-10-16
C 501022162 2018-10-12 2018-10-16
A 503487338 2018-10-13 2018-10-13
C 506012887 2018-10-13 2018-10-21
E 503427339 2018-10-14 2018-10-17
csv_2
Category Mobile_Number Occur_Date
A 503477334 2018-10-16
B 503477334 2018-10-13
A 501022162 2018-10-15
A 503487338 2018-10-13
F 501428449 2018-10-18
C 506012887 2018-10-14
我希望根据Mobile_Number和Category对输出进行过滤
输出
Category Mobile_Number Book_Date App_Date Difference
A 503477334 2018-10-12 2018-10-18 2
B 503477334 2018-10-07 2018-10-16 3
C 501022162 2018-10-12 2018-10-16 NaN
A 503487338 2018-10-13 2018-10-13 0
C 506012887 2018-10-13 2018-10-21 7
E 503427339 2018-10-14 2018-10-17 NaN
答案 0 :(得分:2)
将Series.map
用于与Series
匹配的新Mobile_Number
,并使用Series.between
用于列之间的测试值,然后使用numpy.where
通过掩码分配值:>
df1['Book_Date'] = pd.to_datetime(df1['Book_Date'])
df1['App_Date'] = pd.to_datetime(df1['App_Date'])
df2['Occur_Date'] = pd.to_datetime(df2['Occur_Date'])
s1 = df2.drop_duplicates('Mobile_Number').set_index('Mobile_Number')['Occur_Date']
s2 = df1['Mobile_Number'].map(s1)
m = s2.between(df1['Book_Date'], df1['App_Date'])
#solution with no mask
df1['Difference1'] = df1['App_Date'].sub(s2).dt.days
#solution with test between
df1['Difference2'] = np.where(m, df1['App_Date'].sub(s2).dt.days, np.nan)
print (df1)
Mobile_Number Book_Date App_Date Difference Difference1 Difference2
0 503477334 2018-10-12 2018-10-18 2018-10-16 2.0 2.0
1 506002884 2018-10-12 2018-10-19 2018-10-21 -2.0 NaN
2 501022162 2018-10-12 2018-10-16 2018-10-15 1.0 1.0
3 503487338 2018-10-13 2018-10-13 2018-10-13 0.0 0.0
4 506012887 2018-10-13 2018-10-21 2018-10-14 7.0 7.0
5 503427339 2018-10-14 2018-10-17 NaT NaN NaN
编辑:
您可以使用merge
代替map
通过2列进行联接:
df1['Book_Date'] = pd.to_datetime(df1['Book_Date'])
df1['App_Date'] = pd.to_datetime(df1['App_Date'])
df2['Occur_Date'] = pd.to_datetime(df2['Occur_Date'])
df3 = df1.merge(df2, on=['Category','Mobile_Number'], how='left')
print (df3)
Category Mobile_Number Book_Date App_Date Occur_Date
0 A 503477334 2018-10-12 2018-10-18 2018-10-16
1 B 503477334 2018-10-07 2018-10-16 2018-10-13
2 C 501022162 2018-10-12 2018-10-16 NaT
3 A 503487338 2018-10-13 2018-10-13 2018-10-13
4 C 506012887 2018-10-13 2018-10-21 2018-10-14
5 E 503427339 2018-10-14 2018-10-17 NaT
m = df3['Occur_Date'].between(df3['Book_Date'], df3['App_Date'])
#print (m)
df3['Difference2'] = np.where(m, df3['App_Date'].sub(df3['Occur_Date']).dt.days, np.nan)
print (df3)
Category Mobile_Number Book_Date App_Date Occur_Date Difference2
0 A 503477334 2018-10-12 2018-10-18 2018-10-16 2.0
1 B 503477334 2018-10-07 2018-10-16 2018-10-13 3.0
2 C 501022162 2018-10-12 2018-10-16 NaT NaN
3 A 503487338 2018-10-13 2018-10-13 2018-10-13 0.0
4 C 506012887 2018-10-13 2018-10-21 2018-10-14 7.0
5 E 503427339 2018-10-14 2018-10-17 NaT NaN