Question

如果我有一个如下所示的pandas数据框：

      Sequence     Rating
 0    HYHIVQKF     1
 1    YGEIFEKF     2
 2    TYGGSWKF     3
 3    YLESFYKF     4
 4    YYNTAVKL     5
 5    WPDVIHSF     6

这是我使用的代码返回与以下模式匹配的行： \b.[YF]\w+[LFI]\b

pat = r'\b.[YF]\w+[LFI]\b'
new_df.Sequence.str.contains(pat)

new_df[new_df.Sequence.str.contains(pat)]

上面的代码返回与模式匹配的行，但是我可以使用什么来返回不匹配的行？

预期产出：

     Sequence  Rating
1    YGEIFEKF   2
3    YLESFYKF   4
5    WPDVIHSF   6

Answer 1

您可以~使用not：

pat = r'\b.[YF]\w+[LFI]\b'
new_df[~new_df.Sequence.str.contains(pat)]

#   Sequence    Rating
#1  YGEIFEKF    2
#3  YLESFYKF    4
#5  WPDVIHSF    6

Answer 2

您可以对现有的布尔系列进行否定：

df[~df.Sequence.str.contains(pat)]

这将为您提供所需的输出：

   Sequence  Rating
1  YGEIFEKF       2
3  YLESFYKF       4
5  WPDVIHSF       6

简要说明：

df.Sequence.str.contains(pat)

将返回一个布尔系列：

0     True
1    False
2     True
3    False
4     True
5    False
Name: Sequence, dtype: bool

使用~产生

来取消它

~df.Sequence.str.contains(pat)

0    False
1     True
2    False
3     True
4    False
5     True
Name: Sequence, dtype: bool

这是另一个可以传递给原始数据帧的布尔系列。

Answer 3

Psidom's answer更优雅，但解决此问题的另一种方法是修改正则表达式模式以使用否定先行断言，然后使用match()代替contains()：

pat = r'\b.[YF]\w+[LFI]\b'
not_pat = r'(?!{})'.format(pat)

>>> new_df[new_df.Sequence.str.match(pat)]
   Sequence  Rating
0  HYHIVQKF       1
2  TYGGSWKF       3
4  YYNTAVKL       5

>>> new_df[new_df.Sequence.str.match(not_pat)]
   Sequence  Rating
1  YGEIFEKF       2
3  YLESFYKF       4
5  WPDVIHSF       6

从正则表达式模式返回不匹配的行

3 个答案: