比较两个数据帧并删除不同的日期

时间:2017-07-04 10:19:27

标签: python pandas dataframe

我有两个数据框,想要比较它们并删除df2中与df1不同的天数。我试着用:

df2[~df2.Date.isin(df1.Date)]

但这不起作用并得到一个空数据帧。 df2应该看起来像df1。数据框如下所示:

df1
        Date
0    20-12-16
1    21-12-16
2    22-12-16
3    23-12-16
4    27-12-16
5    28-12-16
6    29-12-16
7    30-12-16
8    02-01-17
9    03-01-17
10   04-01-17
11   05-01-17
12   06-01-17

df2

         Date
0    20-12-16
1    21-12-16
2    22-12-16
3    23-12-16
4    24-12-16
5    25-12-16
6    26-12-16
7    27-12-16
8    28-12-16
9    29-12-16
10   30-12-16
11   31-12-16
12   01-01-17
13   02-01-17
14   03-01-17
15   04-01-17
16   05-01-17
17   06-01-17

2 个答案:

答案 0 :(得分:3)

似乎print (df1.Date.dtype) print (df2.Date.dtype) 不同。比较需要相同。

检查:

df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])

然后根据需要进行转换:

 df = df2[np.in1d(df2.Date, df1.Date)]
print (df)
         Date
0  2016-12-20
1  2016-12-21
2  2016-12-22
3  2016-12-23
7  2016-12-27
8  2016-12-28
9  2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06

我添加了另外两个解决方案 - 首先是numpy.in1d,第二个是merge,因为需要默认的内部联接:

df = df1.merge(df2, on='Date')
print (df)
         Date
0  2016-12-20
1  2016-12-21
2  2016-12-22
3  2016-12-23
7  2016-12-27
8  2016-12-28
9  2016-12-29
10 2016-12-30
13 2017-01-02
14 2017-01-03
15 2017-01-04
16 2017-01-05
17 2017-01-06
d1 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
d2 = {'Date': ['20-12-16', '21-12-16', '22-12-16', '23-12-16', '24-12-16', '25-12-16', '26-12-16', '27-12-16', '28-12-16', '29-12-16', '30-12-16', '31-12-16', '01-01-17', '02-01-17', '03-01-17', '04-01-17', '05-01-17', '06-01-17']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)

样品:

print (df1.Date.dtype)
object

print (df2.Date.dtype)
object

df1['Date'] = pd.to_datetime(df1['Date'], format='%d-%m-%y')
df2['Date'] = pd.to_datetime(df2['Date'], format='%d-%m-%y')
{{1}}

答案 1 :(得分:0)

你的错误来自逻辑。您想要选择df2日期为df1。所以你应该写

df2[df2.Date.isin(df1.Date)]

与df1中的比较/包含为真的布尔值相反

你也可以用

获得相同的结果
set(b.Date)-(set(b.Date)-set(a.Date))

然后应该通过以下方式使用:

pd.DataFrame(sorted((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] )   

虽然排序不是最佳的,你可以用更好的逻辑在熊猫中改变它。

 df = pd.DataFrame(list((set(b.Date)-(set(b.Date)-set(a.Date)))), columns=["Date"] ) 
 df.Date = [date.date() for date in df.Date]

或      df.Date.dt.date

(见How do I convert dates in a Pandas data frame to a 'date' data type?