我写了以下正则表达式来标记某些短语模式
a = 'The pizza was good but pasta was bad'
此模式会正确标记短语,例如:
a = 'The pizza was awesome and brilliant'
并使用2个短语提供所需的输出:
但是,如果我的句子是这样的话:
'pizza was awesome'
仅匹配短语:
'pizza was awesome and brilliant'
而不是所需的:
<nav id="primary_nav_wrap">
<ul>
<li><a href="#">Грузы<span>▼</span></a></li>
<li><a href="#">Транспорт<span>▼</span></a></li>
<li><a href="#">Услуги и цены<span>▼</span></a></li>
<li><a href="#">Зона надежности<span>▼</span></a></li>
<li><a href="#">Каталог</a></li>
<li><a href="#">Форум<span>▼</span></a></li>
<li><a href="#">Полезное<span>▼</span></a></li>
</ul>
</nav>
如何在第二个示例中加入正则表达式模式?
答案 0 :(得分:15)
首先,让我们来看看NLTK给出的POS标签:
>>> from nltk import pos_tag
>>> sent = 'The pizza was awesome and brilliant'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')]
>>> sent = 'The pizza was good but pasta was bad'.split()
>>> pos_tag(sent)
[('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
(注意:以上是NLTK v3.1 pos_tag
的输出,旧版本可能不同)
您想要捕获的内容基本上是:
所以让我们用这些模式捕捉它们:
>>> from nltk import RegexpParser
>>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
>>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']
>>> patterns = """
... P: {<NN><VBD><JJ><CC><JJ>}
... {<NN><VBD><JJ>}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
这就是硬编码“作弊”!!!
让我们回到POS模式:
可以简化为:
因此,您可以在正则表达式中使用可选运算符,例如:
>>> patterns = """
... P: {<NN><VBD><JJ>(<CC><JJ>)?}
... """
>>> PChunker = RegexpParser(patterns)
>>> PChunker.parse(pos_tag(sent1))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
>>> PChunker.parse(pos_tag(sent2))
Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
很可能你正在使用旧的标记器,这就是为什么你的模式不同但我猜你看到如何使用上面的例子捕获你需要的短语。
步骤如下:
pos_tag
RegexpParser