Question

我有以下pandas数据帧df（实际上只是更大的一行）：

                           count
gene                            
WBGene00236788                56
WBGene00236807                 3
WBGene00249816                12
WBGene00249825                20
WBGene00255543                 6
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

我可以使用filter＆＃39; regex选项只获取以两个下划线开头的行：

df.filter(regex="^__", axis=0)

返回以下内容：

                           count
gene                            
__no_feature            11697881
__ambiguous                 1353
__too_low_aQual                0
__not_aligned                  0
__alignment_not_unique         0

实际上，我想要的是补充：只有那些不以两个下划线开头的行。

我可以使用另一个正则表达式：df.filter(regex="^[^_][^_]", axis=0)。

有没有办法更简单地指定我想要初始正则表达式的倒数？

这种基于正则表达式的过滤是否有效？

编辑：测试一些建议的解决方案

df.filter(regex="(?!^__)", axis=0)和df.filter(regex="^\w+", axis=0)都返回所有行。

根据re模块文档，\w特殊字符实际上包含下划线，它解释了第二个表达式的行为。

我猜第一个不起作用，因为(?!...)适用于模式后面的内容。在这里，＆＃34; ^＆＃34;应该放在外面，如下面提出的解决方案：

df.filter(regex="^(?!__).*?$", axis=0)有效。

df.filter(regex="^(?!__)", axis=0)也是如此。

Answer 1

匹配所有没有两个前导下划线的行：

^(?!__)

^匹配行的开头 (?!__)确保该行（前面的^匹配后面的内容）不以两个下划线开头

修改删除.*?$，因为没有必要过滤行。

Answer 2

我遇到了同样的问题，但是我想过滤列。因此，我使用axis = 1，但概念应该相似。

df.drop(df.filter(regex='my_expression').columns,axis=1)

Answer 3

这里有两种可能性：

(?!^__) # a negative lookahead
        # making sure that there are no underscores right at the beginning of the line

或者：

^\w+  # match word characters, aka a-z, A-Z, 0-9 at least once

如何在pandas过滤函数中反转正则表达式

编辑：测试一些建议的解决方案

3 个答案: