pandas中没有.duplicated的输出?

时间:2016-08-07 20:46:34

标签: python pandas numpy

我想在city,round_latitude和round_longitude的列中找到所有重复的行。因此,如果两行在每个列中共享相同的值,则会返回它。

我不确定这里发生了什么:我确定数据集中存在重复项。运行In [38]时不返回错误,返回列名但没有条目。我在这做错了什么?我该如何解决这个问题?

如果有帮助,我也会处理this guide.中的部分代码(格式为HTML。)

# In[29]:

def dl_by_loc(path):
    endname = "USA_downloads.csv"
    with open(path + endname, "r") as f:
        data = pd.read_csv(f)
        data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
        data = data.groupby(["round_latitude","round_longitude","city"]).count()
        data = data.rename(columns = {"date":"downloads"})
        return data["downloads"]


# In[30]:

downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)


# In[31]:

downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]


# In[32]:

downloads_by_coords.head()


# In[38]:

by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes

根据要求,以下是数据中的几行:

2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0

1 个答案:

答案 0 :(得分:1)

dl_by_loc(path)返回带有MultiIndex的系列:

round_latitude  round_longitude  city        
30.0            -95.0            Houston         1
40.0            -75.0            Philadelphia    3
Name: downloads, dtype: int64

如果您查看该函数的定义,它会通过round_latitude,round_longitude和city列对DataFrame进行分组,并计算出现次数。稍后,通过调用reset_index()将其转换为DataFrame。现在,下载列显示了原始DataFrame中每个lat,lon,city组合出现的次数。由于它是groupby结果,因此这些组合实际上不重复,因为它们先前已聚合。如果要从此DataFrame中检测重复的内容,可以使用:

by_coords[by_coords['downloads']>1]

您的方法仍可在原始DataFrame中使用。请注意,使用float类型数据删除重复项或分组数据有一些risks。 Pandas通常会处理它们,但要确保,如果你想要1位精度,你可以乘以10并转换为整数。