我想在city,round_latitude和round_longitude的列中找到所有重复的行。因此,如果两行在每个列中共享相同的值,则会返回它。
我不确定这里发生了什么:我确定数据集中存在重复项。运行In [38]时不返回错误,返回列名但没有条目。我在这做错了什么?我该如何解决这个问题?
如果有帮助,我也会处理this guide.中的部分代码(格式为HTML。)
# In[29]:
def dl_by_loc(path):
endname = "USA_downloads.csv"
with open(path + endname, "r") as f:
data = pd.read_csv(f)
data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
data = data.groupby(["round_latitude","round_longitude","city"]).count()
data = data.rename(columns = {"date":"downloads"})
return data["downloads"]
# In[30]:
downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)
# In[31]:
downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]
# In[32]:
downloads_by_coords.head()
# In[38]:
by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes
根据要求,以下是数据中的几行:
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0
答案 0 :(得分:1)
dl_by_loc(path)
返回带有MultiIndex的系列:
round_latitude round_longitude city
30.0 -95.0 Houston 1
40.0 -75.0 Philadelphia 3
Name: downloads, dtype: int64
如果您查看该函数的定义,它会通过round_latitude,round_longitude和city列对DataFrame进行分组,并计算出现次数。稍后,通过调用reset_index()将其转换为DataFrame。现在,下载列显示了原始DataFrame中每个lat,lon,city组合出现的次数。由于它是groupby结果,因此这些组合实际上不重复,因为它们先前已聚合。如果要从此DataFrame中检测重复的内容,可以使用:
by_coords[by_coords['downloads']>1]
您的方法仍可在原始DataFrame中使用。请注意,使用float类型数据删除重复项或分组数据有一些risks。 Pandas通常会处理它们,但要确保,如果你想要1位精度,你可以乘以10并转换为整数。