Question

我有一个数据框，其中包含针对 iPad 和平板电脑平台的不同公司的“标签信息”。每个“实验”都有一个 id，它可以出现多次，具体取决于实验有多少标签。实验可以在 iPad 或平板电脑（类型）上进行，但我想删除所有重复的实验（iPad 和平板电脑中出现的相同实验）。如果实验来自同一家公司并且具有完全相同的标签，则该实验是重复的。例如，在以下数据帧中，Netflix 是重复的，因为它对 iPad 和平板电脑具有相同的标签（包括下拉菜单、包括产品列表）。所以平板版或iPad版都应该删除。

输入：

id  company   type       tag
1   Netflix   iPad       Includes dropdown
1   Netflix   iPad       Includes product list
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

输出：

id  company   type       tag
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

我正在寻找 Pandas python 中的解决方案。我该怎么做？

我已经试过了

df.drop_duplicates(subset=['tag'], keep='last')

但我不认为解决方案有效，因为有可能有另一个实验是不同的公司，但它包含相同的标签。因此它会删除这个实例，即使它不被认为是重复的。

基本上，我想删除同一家公司具有相同标签的 ID。

Answer 1

我认为您只需将公司名称添加到您的子集参数中。让我们构建一个你想要的数据框：

id = [1, 1, 2, 2, 3, 4]
company = ['Netflix']*4 + ['Apple'] + ['New']
type = ['iPad', 'iPad', 'Tablet', 'Tablet', 'iPad', 'Tablet']
tag = ['Includes dropdown', 'Includes product list']*2 + ['Includes images']*2
data = {'id':id, 'company': company, 'type':type, 'tag':tag}
df = pd.DataFrame(data)

打印df，这是数据框：

您会看到 id 3 和 4 具有相同的标签但公司名称不同，就像您提到的，如果我们只使用您尝试过的代码：

df.drop_duplicates(subset=['tag'], keep='last')

我们会得到这个：

在上图中，id 3 被删除了，这是您要避免的。但是，如果我们只是将公司添加到子集：

df.drop_duplicates(subset=['company', 'tag'], keep='last')

我们会得到你想要的：

为非唯一 id 删除具有相同列值的观察

1 个答案: