Question

我正在使用textacy的{{1}}方法在句子中查找某些文本块。

例如，假设我有文本：pos_regex_matches，我想检测Huey, Dewey, and Louie are triplet cartoon characters.是枚举。

为此，我使用以下代码（在Huey, Dewey, and Louie上，在撰写本文时可用的版本）：

testacy 0.3.4

打印：

import textacy

sentence = 'Huey, Dewey, and Louie are triplet cartoon characters.'
pattern = r'<PROPN>+ (<PUNCT|CCONJ> <PUNCT|CCONJ>? <PROPN>+)*'
doc = textacy.Doc(sentence, lang='en')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
    print(list.text)

但是，如果我有以下内容：

Huey, Dewey, and Louie

然后sentence = 'Donald Duck - Disney'（短划线）被识别为-并且整个句子被识别为列表 - 它不是。

有没有办法指定列表中只有<PUNCT>和,有效;？

我已经找到了一些关于这种正则表达式语言的参考资料，以便在没有运气的情况下匹配PoS标签，有人可以提供帮助吗？提前谢谢！

Answer 1

很短，不可能：见this official page。

然而，合并请求包含页面中描述的修改版本的代码，因此可以重新创建功能，尽管它的性能低于使用SpaCy {{1 （请参阅code和example - 虽然我不知道如何使用Matcher重新实现我的问题。

如果你想要沿着这条路走下去，你必须改变这条线：

Matcher

以下内容：

words.extend(map(lambda x: re.sub(r'\W', '', x), keyword_map[w]))

否则每个符号（例如我的words.extend(keyword_map[w])和,）都会被剥离。

使用`testacy.extract.pos_regex_matches（...）`匹配具有特定文本的PoS标记

1 个答案: