首先,解决方案在我的代码pandas merge on date column issue
中无效我有两个来自mysql查询结果的数据框,它们都有' captureDate'柱。在mysql表中,数据类型是' date'。在数据框中,数据类型是对象。
df1 [' captureDate']数据
0 2017-06-28
1 2017-06-28
2 2017-06-28
3 2017-06-28
4 2017-06-28
5 2017-06-28
6 2017-06-28
Name: captureDate, dtype: object
df2 [' captureDate']数据
0 2017-06-28
1 2017-06-28
2 2017-06-28
3 2017-06-28
4 2017-06-28
5 2017-06-28
6 2017-06-28
Name: captureDate, dtype: object
当我比较df1和df2的列时,它返回True
print df1['captureDate'].equals(df2['captureDate'])
我的合并代码
inner = pd.merge(df1, df2, on='captureDate', how='inner')
但是,结果是错误的,它返回了49行。内部信息是打击:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 48
Data columns (total 20 columns):
rule_id_x 49 non-null int64
monitor_sites_x 49 non-null object
rule_type_x 49 non-null int64
lower_limit_x 49 non-null int64
upper_limit_x 49 non-null int64
actual_x 49 non-null int64
predict_x 49 non-null int64
captureDate 49 non-null object
deviation_x 49 non-null float32
create_time_x 49 non-null int64
actual_y 49 non-null int64
create_time_y 49 non-null int64
deviation_y 49 non-null object
id 49 non-null int64
lower_limit_y 49 non-null int64
monitor_sites_y 49 non-null object
predict_y 49 non-null int64
rule_id_y 49 non-null object
rule_type_y 49 non-null int64
upper_limit_y 49 non-null int64
那么,为什么会发生以及如何处理这个问题?
答案 0 :(得分:2)
样品:
df1 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
'rule_id':[40,10,20,30,70,10,60,10]})
print (df1)
captureDate rule_id
0 2017-06-22 40
1 2017-06-22 10
2 2017-06-22 20
3 2017-06-25 30
4 2017-06-25 70
5 2017-06-25 10
6 2017-06-28 60
7 2017-06-28 10
df2 = pd.DataFrame({'captureDate':['2017-06-22'] *3 +['2017-06-25'] * 3 +['2017-06-28'] * 2,
'rule_id':[1,2,3,4,5,6,7,8]})
print (df2)
captureDate rule_id
0 2017-06-22 1
1 2017-06-22 2
2 2017-06-22 3
3 2017-06-25 4
4 2017-06-25 5
5 2017-06-25 6
6 2017-06-28 7
7 2017-06-28 8
首先按to_datetime
转换为datetime:
df1['captureDate'] = pd.to_datetime(df1['captureDate'])
df2['captureDate'] = pd.to_datetime(df2['captureDate'])
两列中的问题都是重复的:
print (df1['captureDate'].equals(df2['captureDate']))
True
inner = pd.merge(df1, df2, on='captureDate', how='inner')
print (inner)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-22 40 2
2 2017-06-22 40 3
3 2017-06-22 10 1
4 2017-06-22 10 2
5 2017-06-22 10 3
6 2017-06-22 20 1
7 2017-06-22 20 2
8 2017-06-22 20 3
9 2017-06-25 30 4
10 2017-06-25 30 5
11 2017-06-25 30 6
12 2017-06-25 70 4
13 2017-06-25 70 5
14 2017-06-25 70 6
15 2017-06-25 10 4
16 2017-06-25 10 5
17 2017-06-25 10 6
18 2017-06-28 60 7
19 2017-06-28 60 8
20 2017-06-28 10 7
21 2017-06-28 10 8
可能的解决方案
将concat
与set_index
一起使用,然后MultiIndex
和map
展开join
:
df3 = pd.concat([df1.set_index('captureDate'),
df2.set_index('captureDate')],
axis=1,
keys=('a', 'b'))
df3.columns = df3.columns.map('_'.join)
print (df3)
a_rule_id b_rule_id
captureDate
2017-06-22 40 1
2017-06-22 10 2
2017-06-22 20 3
2017-06-25 30 4
2017-06-25 70 5
2017-06-25 10 6
2017-06-28 60 7
2017-06-28 10 8
或删除drop_duplicates
的重复项或captureDate
中df
的数据汇总:
df1 = df1.drop_duplicates('captureDate')
df2 = df2.drop_duplicates('captureDate')
print (df1)
captureDate rule_id
0 2017-06-22 40
3 2017-06-25 30
6 2017-06-28 60
print (df2)
captureDate rule_id
0 2017-06-22 1
3 2017-06-25 4
6 2017-06-28 7
inner = pd.merge(df1, df2, on='captureDate', how='inner')
print (inner)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-25 30 4
2 2017-06-28 60 7
EDIT1:
您可以使用cumcount
按captureDate
列然后merge
来计算重复次数。最后按drop
删除了帮助列new
:
df1 = pd.DataFrame({'captureDate':['2017-06-22']* 3 + ['2017-06-25']* 3 + ['2017-06-28'] * 2,
'rule_id':[40,10,20,30,70,10,60,10]})
df2 = pd.DataFrame({'captureDate':['2017-06-22'] * 3 + ['2017-06-25'] * 3,
'rule_id':[1,2,3,4,5,6]})
df1['new'] = df1.groupby('captureDate').cumcount()
df2['new'] = df2.groupby('captureDate').cumcount()
print (df1)
captureDate rule_id new
0 2017-06-22 40 0
1 2017-06-22 10 1
2 2017-06-22 20 2
3 2017-06-25 30 0
4 2017-06-25 70 1
5 2017-06-25 10 2
6 2017-06-28 60 0
7 2017-06-28 10 1
print (df2)
captureDate rule_id new
0 2017-06-22 1 0
1 2017-06-22 2 1
2 2017-06-22 3 2
3 2017-06-25 4 0
4 2017-06-25 5 1
5 2017-06-25 6 2
df3 = pd.merge(df1, df2, on=['captureDate','new']).drop('new', axis=1)
print (df3)
captureDate rule_id_x rule_id_y
0 2017-06-22 40 1
1 2017-06-22 10 2
2 2017-06-22 20 3
3 2017-06-25 30 4
4 2017-06-25 70 5
5 2017-06-25 10 6