Question

我有以下数据框

df = pd.DataFrame([[1990,7,1000],[1990,8,2500],[1990,9,2500],[1990,9,1500],[1991,1,250],[1991,2,350],[1991,3,350],[1991,7,450]], columns = ['year','month','data1'])

year    month    data1
1990      7      1000
1990      8      2500
1990      9      2500
1990      9      1500
1991      1      250
1991      2      350
1991      3      350
1991      7      450

我想过滤数据，使其不包含月/年 07/1990、08/1990 和 01/1991 的数据。我可以为每个组合月/年做如下：

df = df.loc[(df.year != 1990) | (df.month != 7)]

但是如果月/年有很多组合，效率不高。有没有更有效的方法来做到这一点？

非常感谢。

Answer 1

你可以这样做：

mask = ~df[['year', 'month']].apply(tuple, 1).isin([(1990, 7), (1990, 8), (1991, 1)])
print(df[mask])

输出

   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

Answer 2

甚至更快（大约是应用 tuple 的 @DaniMesejo 的优雅版本的 3 倍）。但它也依赖于月份限制在（远低于）100 的知识，因此不太具有概括性：

mask = ~(df.year*100 + df.month).isin({199007, 199008, 199101})
df[mask]

# out:
   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

为什么这比元组解决方案快 3 倍？（速度技巧）：

所有矢量化操作，没有 apply。
没有字符串操作，都是整数。
使用带有集合的.isin()作为参数（不是列表）。

Answer 3

让我们试试store

merge

小改进

out = df.drop(df.reset_index().merge(pd.DataFrame({'year':[1990,1990,1991],'month':[7,8,1]}))['index'])
   year  month  data1
2  1990      9   2500
3  1990      9   1500
5  1991      2    350
6  1991      3    350
7  1991      7    450

根据我的测试，这应该比应用元组方法快~

Answer 4

[编辑] 只需遍历您的 df 并删除所有索引：

dates_to_avoid = [(1990, 7), (1990, 8), (1991, 1)])
index_to_delete = [row.index for row in df.itertuples() if (row.year, row.month) in dates_to_avoid]

那么：

df = df.loc[~df.index.isin(index_to_delete)]

此处关键字 ~ 将避免列表中的所有值。

Answer 5

您可以为 yyyymm 添加一个值，然后使用它来删除您想要的数据。

df['yyyymm'] = df['year'].astype(str) + df['month'].astype(str).zfill(2)
df = df.loc[(df.yyyymm != '199007') & (df.yyyymm != '199008') & (df.yyyymm != '199101')]

Answer 6

将 <div class="demo">text</div> <div class="demo">text</div> <div class="demo">text</div> <div class="demo">text</div> <div class="demo">text</div> <div class="demo">text</div> <div class="demo">text</div> 和 year 设置为索引，使用 month 和 isin 过滤索引并重置索引：

callable

或者，您可以使用 ( df.set_index(["year", "month"]) .loc[lambda x: ~x.index.isin([(1990, 7), (1990, 8), (1991, 1)])] .reset_index() ) year month data1 0 1990 9 2500 1 1990 9 1500 2 1991 2 350 3 1991 3 350 4 1991 7 450 函数删除您不感兴趣的行，并重置索引：

drop

Answer 7

您可以使用apply函数

def filter(row):
  # Add other conditions and just return row accordingly
  if ((row.year != 1990) | (row.month != 7)):
    return True
  return False

mask = df.apply(filter,axis=1)
df[mask]

Pandas 数据帧过滤多种条件

7 个答案: