我有以下两个数据帧:
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})
df1的问题是[' date]列包含的值无法直接解析为时间戳。所以我使用了以下函数:
def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res
然后我将新列作为:
df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)
我想在[' date_new']上合并两个数据框,但没有匹配的值。
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
然而,
df1['date_new'][0] == df2['date_new'][1]
返回 True 。 完整代码如下:
import pandas as pd
def to_datetime(x):
try:
res = pd.to_datetime(x)
except:
res = x
return res
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
'value':[4, 5, 6]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
'value':[14, 55, 36]})
df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
请告诉我为什么会这样。谢谢!
答案 0 :(得分:1)
pd.to_datetime
有一个方便的errors
参数,您可以将其设置为coerce
。然后,您的代码似乎有效:
df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14
注意,因为您的日期是强制的,如果它们不符合日期格式,它们将会显示为NaT
,因此这些强制值在合并时会匹配。例如:
df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31','xyz'],
'value':[4, 5, 6, 14]})
df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31','sss'],
'value':[14, 55, 36, 12]})
df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')
df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])
导致这个:
>>> df3
date_x value_x date_new date_y value_y
0 2012-12-31 4 2012-12-31 2012-12-31 55
1 2013-12-31 5 2013-12-31 2013-12-31 14
2 9999-12-31 6 NaT sss 12
3 xyz 14 NaT sss 12
要避免这些,您可以合并date_new
不为空的数据框的子集:
df3 = pd.merge(df1.loc[df1.date_new.notnull()], df2.loc[df2.date_new.notnull()], how = 'inner', on = ['date_new'])
对于代码中出现的 为什么 ,如果找到不合适的日期,您的函数最终会返回一系列dtype: object
:
df1['date_new'] = df1['date'].apply(to_datetime)
>>> df1['date_new']
0 2012-12-31 00:00:00
1 2013-12-31 00:00:00
2 9999-12-31
Name: date_new, dtype: object
但是当所有日期都正常时,系列类型为dtype: datetime64[ns]
:
df2['date_new'] = df2['date'].apply(to_datetime)
>>> df2['date_new']
0 2013-12-31
1 2012-12-31
2 2010-12-31
Name: date_new, dtype: datetime64[ns]
所以这些没有正确合并