Question

如何匹配某些文本中最长的“和链”？

例如，考虑

“论坛上有果酱，浆果和葡萄酒，以及面包和黄油，奶酪和牛奶，甚至还有巧克力和pista！”

我该如何匹配

'jam and berry and wine'

和

'bread and butter and cheese and milk'

不知道用＆分隔的项的数量吗？

这就是我尝试过的。

import spacy
from spacy.matcher import Matcher

nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)

pattern = [{'IS_ASCII': True}, {'LOWER': 'and'}, {'IS_ASCII': True}]
matcher.add("AND_PAT", None, pattern)

doc = nlp("The forum had jam and berry and wine along with bread and butter and cheese and milk, even chocolate and pista!")

for match_id, start, end in matcher(doc):
    print(doc[start: end].text)

但是这并没有进行我需要的“惰性”匹配。

我看过documentation，它提到了制定规则的OP键，但这似乎仅在连续重复相同令牌时才有用。

此外，匹配项应该有点贪婪，并且一旦找到可接受的模式就不应给出结果。在上面的示例中，期望的结果是不像（在我的程序中一样）

jam and berry
berry and wine

但作为

jam and berry and wine

这是一个可以使用正则表达式解决的问题，但我希望使用spaCy的规则匹配解决方案。最好甚至不使用here所述的REGEX运算符。

Answer 1

尝试一下：

l = [{t.nbor(-1).i, t.i, t.nbor().i} for t in doc if t.text == 'and']
bag = set().union(*l) #The * operator unpacks an argument list
st = " ".join([t.text if t.i in bag else '\n' for t in doc])
result = [part.strip() for part in st.split('\n') if part.strip()]


# result = ['jam and berry and wine',
# 'bread and butter and cheese and milk',
# 'chocolate and pista']

请注意，这假设第一个和最后一个标记不是“和”标记。

在spaCy中找到最长的链

1 个答案: