如何在Python中指定重复Regex

时间:2017-08-19 06:23:23

标签: python regex repeat

我想要处理这个字符串:

rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/.

我想从那句话中取出di/IN jogja/NNP buat/VBT malioboro/NNP个字。到目前为止,这是我的代码:

def entityExtractPreposition(text):
    text = re.findall(r'([^\s/]*/IN\b[^/]*(?:/(?!IN\b)[^/]*)*/NNP\b)', text)
    return text

text = "rl/NNP ada/VBI yg/SC tau/VBT penginapan/NN under/NN 800k/CDP di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP ?/."
prepo = entityExtractPreposition(text)
print prepo

结果很明显:

di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP

我的预期结果是:

di/IN jogja/NNP buat/VBT malioboro/NNP

我读过一些参考资料说有一条规则限制重复(在我的情况下为/ NNP),如* / + / ?。初始化或限制正则表达式中重复次数的最佳方法是什么?

2 个答案:

答案 0 :(得分:1)

你需要两次通过。首先找到一块/ IN - > / NNP,然后在该块内搜索最多只占用第二个(或n)/ NNP,例如:

def extract(text, n=2):
    try:
        match = re.search('\w+/IN.*\w+/NNP', text).group()
        last_match = list(re.finditer('\w+/NNP', match))[:n][-1]
        return match[:last_match.end()]
    except AttributeError:
        return ''

使用和输出示例:

In [36]: extract(text, 1)
Out[36]: 'di/IN jogja/NNP'

In [37]: extract(text, 2)
Out[37]: 'di/IN jogja/NNP buat/VBT malioboro/NNP'

In [38]: extract(text, 3)
Out[38]: 'di/IN jogja/NNP buat/VBT malioboro/NNP +-10/NN org/NN yg/SC deket/JJ malioboro/NNP'

In [39]: extract('nothing to see here')
Out[39]: ''

答案 1 :(得分:0)

  

第一个/ IN直到并包括第二个/ NNP

实施规则的模式:

^.*?\b(\w+\/IN(?:.*?\w+\/NNP\b){2})

^.*?      # Starting from the beginning, thus match only first
\b        # A word boundary
(         # Captured group
\w+\/IN   # One or more word chars, then a slash, then 'IN'
(?:       # A non-captured group
.*?\w+    # Anything, lazily matched, followed by one or more word chars
\/NNP\b   # A slash, then 'NNP', then a word boundary
){2}      # Exactly twice
)         # End of captured group

Demo