Question

我想在city，round_latitude和round_longitude的列中找到所有重复的行。因此，如果两行在每个列中共享相同的值，则会返回它。

我不确定这里发生了什么：我确定数据集中存在重复项。运行In [38]时不返回错误，返回列名但没有条目。我在这做错了什么？我该如何解决这个问题？

如果有帮助，我也会处理this guide.中的部分代码（格式为HTML。）

# In[29]:

def dl_by_loc(path):
    endname = "USA_downloads.csv"
    with open(path + endname, "r") as f:
        data = pd.read_csv(f)
        data.columns = ["date","city","coords","doi","latitude","longitude","round_latitude","round_longitude"]
        data = data.groupby(["round_latitude","round_longitude","city"]).count()
        data = data.rename(columns = {"date":"downloads"})
        return data["downloads"]


# In[30]:

downloads_by_coords = dl_by_loc(path)
len(downloads_by_coords)


# In[31]:

downloads_by_coords = downloads_by_coords.reset_index()
downloads_by_coords.columns = ["round_latitude","round_longitude","city","downloads"]


# In[32]:

downloads_by_coords.head()


# In[38]:

by_coords = downloads_by_coords.reset_index()
coord_dupes = by_coords[by_coords.duplicated(subset=["round_latitude","round_longitude","city"])]
coord_dupes

根据要求，以下是数据中的几行：

2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1042/BJ20091140,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1096/fj.05-5309fje,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:19,Philadelphia,"39.9525839,-75.1652215",10.1186/1478-811X-11-15,39.9525839,-75.1652215,40.0,-75.0
2016-02-16 00:32:21,Houston,"29.7604267,-95.3698028",10.1039/P19730002379,29.7604267,-95.36980279999999,30.0,-95.0

Answer 1

dl_by_loc(path)返回带有MultiIndex的系列：

round_latitude  round_longitude  city        
30.0            -95.0            Houston         1
40.0            -75.0            Philadelphia    3
Name: downloads, dtype: int64

如果您查看该函数的定义，它会通过round_latitude，round_longitude和city列对DataFrame进行分组，并计算出现次数。稍后，通过调用reset_index（）将其转换为DataFrame。现在，下载列显示了原始DataFrame中每个lat，lon，city组合出现的次数。由于它是groupby结果，因此这些组合实际上不重复，因为它们先前已聚合。如果要从此DataFrame中检测重复的内容，可以使用：

by_coords[by_coords['downloads']>1]

您的方法仍可在原始DataFrame中使用。请注意，使用float类型数据删除重复项或分组数据有一些risks。 Pandas通常会处理它们，但要确保，如果你想要1位精度，你可以乘以10并转换为整数。

pandas中没有.duplicated的输出？

1 个答案: