Question

首先，解决方案在我的代码pandas merge on date column issue

中无效

我有两个来自mysql查询结果的数据框，它们都有＆＃39; captureDate＆＃39;柱。在mysql表中，数据类型是＆＃39; date＆＃39;。在数据框中，数据类型是对象。

df1 [＆＃39; captureDate＆＃39;]数据

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

df2 [＆＃39; captureDate＆＃39;]数据

0    2017-06-28
1    2017-06-28
2    2017-06-28
3    2017-06-28
4    2017-06-28
5    2017-06-28
6    2017-06-28
Name: captureDate, dtype: object

当我比较df1和df2的列时，它返回True

print df1['captureDate'].equals(df2['captureDate'])

我的合并代码

inner = pd.merge(df1, df2,  on='captureDate', how='inner')

但是，结果是错误的，它返回了49行。内部信息是打击：

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 20 columns):
rule_id_x          49 non-null int64
monitor_sites_x    49 non-null object
rule_type_x        49 non-null int64
lower_limit_x      49 non-null int64
upper_limit_x      49 non-null int64
actual_x           49 non-null int64
predict_x          49 non-null int64
captureDate        49 non-null object
deviation_x        49 non-null float32
create_time_x      49 non-null int64
actual_y           49 non-null int64
create_time_y      49 non-null int64
deviation_y        49 non-null object
id                 49 non-null int64
lower_limit_y      49 non-null int64
monitor_sites_y    49 non-null object
predict_y          49 non-null int64
rule_id_y          49 non-null object
rule_type_y        49 non-null int64
upper_limit_y      49 non-null int64

那么，为什么会发生以及如何处理这个问题？

Answer 1

样品：

df1 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})
print (df1)
  captureDate  rule_id
0  2017-06-22       40
1  2017-06-22       10
2  2017-06-22       20
3  2017-06-25       30
4  2017-06-25       70
5  2017-06-25       10
6  2017-06-28       60
7  2017-06-28       10
df2 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
                   'rule_id':[1,2,3,4,5,6,7,8]})
print (df2)
  captureDate  rule_id
0  2017-06-22        1
1  2017-06-22        2
2  2017-06-22        3
3  2017-06-25        4
4  2017-06-25        5
5  2017-06-25        6
6  2017-06-28        7
7  2017-06-28        8

首先按to_datetime转换为datetime：

df1['captureDate'] = pd.to_datetime(df1['captureDate'])
df2['captureDate']  = pd.to_datetime(df2['captureDate'])

两列中的问题都是重复的：

print (df1['captureDate'].equals(df2['captureDate']))
True

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
   captureDate  rule_id_x  rule_id_y
0   2017-06-22         40          1
1   2017-06-22         40          2
2   2017-06-22         40          3
3   2017-06-22         10          1
4   2017-06-22         10          2
5   2017-06-22         10          3
6   2017-06-22         20          1
7   2017-06-22         20          2
8   2017-06-22         20          3
9   2017-06-25         30          4
10  2017-06-25         30          5
11  2017-06-25         30          6
12  2017-06-25         70          4
13  2017-06-25         70          5
14  2017-06-25         70          6
15  2017-06-25         10          4
16  2017-06-25         10          5
17  2017-06-25         10          6
18  2017-06-28         60          7
19  2017-06-28         60          8
20  2017-06-28         10          7
21  2017-06-28         10          8

可能的解决方案

将concat与set_index一起使用，然后MultiIndex和map展开join：

df3 = pd.concat([df1.set_index('captureDate'), 
                 df2.set_index('captureDate')], 
                 axis=1, 
                 keys=('a', 'b'))
df3.columns = df3.columns.map('_'.join)
print (df3)
             a_rule_id  b_rule_id
captureDate                      
2017-06-22          40          1
2017-06-22          10          2
2017-06-22          20          3
2017-06-25          30          4
2017-06-25          70          5
2017-06-25          10          6
2017-06-28          60          7
2017-06-28          10          8

或删除drop_duplicates的重复项或captureDate中df的数据汇总：

df1 = df1.drop_duplicates('captureDate')
df2 = df2.drop_duplicates('captureDate')
print (df1)
  captureDate  rule_id
0  2017-06-22       40
3  2017-06-25       30
6  2017-06-28       60

print (df2)
  captureDate  rule_id
0  2017-06-22        1
3  2017-06-25        4
6  2017-06-28        7

inner = pd.merge(df1, df2,  on='captureDate', how='inner')
print (inner)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-25         30          4
2  2017-06-28         60          7

EDIT1：

您可以使用cumcount按captureDate列然后merge来计算重复次数。最后按drop删除了帮助列new：

df1 = pd.DataFrame({'captureDate':['2017-06-22']* 3 + ['2017-06-25']* 3 + ['2017-06-28'] * 2,
                   'rule_id':[40,10,20,30,70,10,60,10]})

df2 = pd.DataFrame({'captureDate':['2017-06-22'] * 3 + ['2017-06-25'] * 3,
                   'rule_id':[1,2,3,4,5,6]})


df1['new'] = df1.groupby('captureDate').cumcount()
df2['new'] = df2.groupby('captureDate').cumcount()
print (df1)
  captureDate  rule_id  new
0  2017-06-22       40    0
1  2017-06-22       10    1
2  2017-06-22       20    2
3  2017-06-25       30    0
4  2017-06-25       70    1
5  2017-06-25       10    2
6  2017-06-28       60    0
7  2017-06-28       10    1

print (df2)
  captureDate  rule_id  new
0  2017-06-22        1    0
1  2017-06-22        2    1
2  2017-06-22        3    2
3  2017-06-25        4    0
4  2017-06-25        5    1
5  2017-06-25        6    2

df3 = pd.merge(df1, df2, on=['captureDate','new']).drop('new', axis=1)
print (df3)
  captureDate  rule_id_x  rule_id_y
0  2017-06-22         40          1
1  2017-06-22         10          2
2  2017-06-22         20          3
3  2017-06-25         30          4
4  2017-06-25         70          5
5  2017-06-25         10          6

pandas合并日期，如列不起作用

1 个答案: