Pandas df.drop_duplicates()不会删除重复项; df.sort()没有排序

时间:2015-06-23 23:39:10

标签: python pandas dataframe

我不确定发生了什么,但是没有按日期排序,也没有drop_duplicates()适用于我的数据集。

我正在连接两个数据集,如下所示:

df = df.reset_index(drop=True)
test_file = pd.read_csv('test_df.csv',index_col=0,encoding="utf_8")
test_file = test_file.reset_index(drop=True)

#compare csv with new results
df3 = pd.concat([df,test_file])
df3 = df3.reset_index(drop=True)
df3.groupby(list(df3.columns)).filter(lambda df3:df3.shape[0] == 1)
df3 = df3.reset_index(drop=True)
df3 = df3.sort('_date')
df3 = df3.drop_duplicates()

这似乎根本不是删除重复项 - 或按日期排序。

例如,稍后在文件中:

462,,,,,,51.0,,,,,,,,,,,,,,,37.0,,,2015-06-22 00:00:00,General Election: Walker vs. Clinton,NBC News/Wall St. Jrnl
463,,40.0,,,,48.0,,,,,,,,,,,,,,,,,,2015-06-22 00:00:00,General Election: Bush vs. Clinton,NBC News/Wall St. Jrnl

并向顶部发展:

222,,,,,,51.0,,,,,,,,,,,,,,,37.0,,,2015-06-22 00:00:00,General Election: Walker vs. Clinton,NBC News/Wall St. Jrnl
223,,40.0,,,,48.0,,,,,,,,,,,,,,,,,,2015-06-22 00:00:00,General Election: Bush vs. Clinton,NBC News/Wall St. Jrnl

正如您所看到的,除索引外,这些行是相同的。并且日期是相同的,但日期排序不是排序。这可能是一个类型问题吗?

列名:

,Biden,Bush,Carson,Chafee,Christie,Clinton,Cruz,Fiorina,Graham,Huckabee,Jindal,Kasich,O'Malley,Pataki,Paul,Perry,Rubio,Sanders,Santorum,Trump,Walker,Warren,Webb,_date,_poll,_pollname

想法?

1 个答案:

答案 0 :(得分:1)

如上所述,日期实际上是字符串日期而不是日期时间对象。

简单的解决方法是:

df3['_date'] = pd.to_datetime(df3['_date'])
在此更改后,

drop_duplicates()也开始正常运行。

最终代码:

#compare csv with new results
df3 = pd.concat([df,test_file])
df3 = df3.reset_index(drop=True)
#remove unnecessary groupby method
#df3 = df3.groupby(list(df3.columns), as_index=False)#.filter(lambda df3:df3.shape[0] > 1)
df3 = df3.drop_duplicates(list(df3.columns))
df3['_date']  = pd.to_datetime(df3['_date'])
df3 = df3.sort('_date')
df3 = df3.drop_duplicates()
df3 = df3.reset_index(drop=True)