用Python中的正则表达式中的lookbehind和负向lookbehind断言捆绑结

时间:2018-03-10 22:19:01

标签: python regex pandas regex-lookarounds negative-lookbehind

我有一个Pandas数据框,它有一列字符串数据,由两个不同的部分组成,用正斜杠分隔。我想从字符串的“右侧”提取文本模式,但是如果存在特定的字符串模式则不能。以下简单的例子说明了这个问题。

import numpy as np
import pandas as pd
import re

myDF = pd.DataFrame({'pet':['rabbit','mammal/rabbit','mammal/small fluffy rabbit','mammal/lop-eared rabbit','mammal/many rabbits','mammal/jack rabbit']})

所以,数据框看起来像:

                          pet
0                      rabbit
1               mammal/rabbit
2  mammal/small fluffy rabbit
3     mammal/lop-eared rabbit
4         mammal/many rabbits
5          mammal/jack rabbit

我希望能够提取与兔子相关的术语,但前提是它们出现在/分隔符的右侧,而不是rabbit前面带有jack(有或没有介入空间)。

我想出的正则表达式是:

rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'

...我希望任何匹配都需要/,但如果前面有jack则不会。但是,它没有像我希望的那样起作用。我尝试过很多变化而没有任何运气。

rxStr = '(?P<bunny>(?<=/)(?<!jack)(?:.*rabbits?))'

rx = re.compile(rxStr,flags=re.I|re.X)

rabbitDF = myDF['pet'].str.extract(rx,expand=True)

myDF = myDF.join(rabbitDF)

print(myDF)

                          pet                bunny
0                      rabbit                  NaN
1               mammal/rabbit               rabbit
2  mammal/small fluffy rabbit  small fluffy rabbit
3     mammal/lop-eared rabbit     lop-eared rabbit
4         mammal/many rabbits         many rabbits
5          mammal/jack rabbit          jack rabbit

在第0行中,正则表达式无法找到匹配项,因为没有/个字符。但是,在第5行jack rabbit匹配,jack前面有rabbit

如何编写能够识别rabbit条款的正则表达式,但前提是/,而不是前面有jack?任何解释为什么上面给出的正则表达式失败也将非常感激。

2 个答案:

答案 0 :(得分:3)

使用前瞻而不是后视:

myDF.pet.str.extract('(?P<bunny>(?<=/)(?!jack).*rabbit)', expand=True)

                 bunny
0                  NaN
1               rabbit
2  small fluffy rabbit
3     lop-eared rabbit
4          many rabbit
5                  NaN

(               # capture group
    (?<=/)      # lookbehind - forwardslash
    (?!jack)    # negative lookahead - "jack" 
    .*          # match anything
    rabbit      # match "rabbit"
)

在这里,负向前瞻意味着fwslash不得跟随&#34; jack&#34;。

答案 1 :(得分:3)

In [52]:  myDF['pet'].str.extract(r'/(?P<bunny>(?!jack).*rabbits?.*)',expand=True)
Out[52]:
                 bunny
0                  NaN
1               rabbit
2  small fluffy rabbit
3     lop-eared rabbit
4         many rabbits
5                  NaN

RegEx explained ...