解析日期列后,pandas合并函数出现问题

时间:2018-03-11 03:06:34

标签: python pandas merge

我有以下两个数据帧:

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
                    'value':[4, 5, 6]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
                    'value':[14, 55, 36]}) 

df1的问题是[' date]列包含的值无法直接解析为时间戳。所以我使用了以下函数:

def to_datetime(x):
    try:
       res = pd.to_datetime(x)
    except:
       res = x
    return res

然后我将新列作为:

df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)

我想在[' date_new']上合并两个数据框,但没有匹配的值。

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

然而,

df1['date_new'][0] == df2['date_new'][1]

返回 True 。 完整代码如下:

import pandas as pd

def to_datetime(x):
    try:
        res = pd.to_datetime(x)
    except:
        res = x
    return res

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31'],
                    'value':[4, 5, 6]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31'],
                    'value':[14, 55, 36]})

df1['date_new'] = df1['date'].apply(to_datetime)
df2['date_new'] = df2['date'].apply(to_datetime)

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

请告诉我为什么会这样。谢谢!

1 个答案:

答案 0 :(得分:1)

pd.to_datetime有一个方便的errors参数,您可以将其设置为coerce。然后,您的代码似乎有效:

df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])


>>> df3
       date_x  value_x   date_new      date_y  value_y
0  2012-12-31        4 2012-12-31  2012-12-31       55
1  2013-12-31        5 2013-12-31  2013-12-31       14

注意,因为您的日期是强制的,如果它们不符合日期格式,它们将会显示为NaT,因此这些强制值在合并时会匹配。例如:

df1 = pd.DataFrame({'date':['2012-12-31', '2013-12-31', '9999-12-31','xyz'],
                    'value':[4, 5, 6, 14]})

df2 = pd.DataFrame({'date':['2013-12-31', '2012-12-31', '2010-12-31','sss'],
                    'value':[14, 55, 36, 12]})

df1['date_new'] = pd.to_datetime(df1['date'], errors='coerce')
df2['date_new'] = pd.to_datetime(df2['date'], errors='coerce')

df3 = pd.merge(df1, df2, how = 'inner', on = ['date_new'])

导致这个:

>>> df3
       date_x  value_x   date_new      date_y  value_y
0  2012-12-31        4 2012-12-31  2012-12-31       55
1  2013-12-31        5 2013-12-31  2013-12-31       14
2  9999-12-31        6        NaT         sss       12
3         xyz       14        NaT         sss       12

要避免这些,您可以合并date_new不为空的数据框的子集:

df3 = pd.merge(df1.loc[df1.date_new.notnull()], df2.loc[df2.date_new.notnull()], how = 'inner', on = ['date_new'])

对于代码中出现的 为什么 ,如果找到不合适的日期,您的函数最终会返回一系列dtype: object

df1['date_new'] = df1['date'].apply(to_datetime)
>>> df1['date_new']
0    2012-12-31 00:00:00
1    2013-12-31 00:00:00
2             9999-12-31
Name: date_new, dtype: object

但是当所有日期都正常时,系列类型为dtype: datetime64[ns]

df2['date_new'] = df2['date'].apply(to_datetime)
>>> df2['date_new']
0   2013-12-31
1   2012-12-31
2   2010-12-31
Name: date_new, dtype: datetime64[ns]

所以这些没有正确合并