Question

这句话是简体维基百科的一部分：

空气中有3种物质：氮气（79％），氧气（20％）和其他类型的气体（1％）。

在spaCy 2.0和2.1中，括号百分率处理不佳。解决此类问题的最佳方法是什么？

以下是可视化效果：

Answer 1

使用正则表达式和spacy的merge / retokenize方法将括号中的内容作为单个令牌进行合并。

>>> import spacy
>>> import re
>>> my_str = "There are three things in air, Nitrogen (79%), oxygen (20%), and other types of gases (1%)."
>>> nlp = spacy.load('en')
>>> parsed = nlp(my_str)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(', 'PUNCT'), ('79', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(', 'PUNCT'), ('20', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(', 'PUNCT'), ('1', 'NUM'), ('%', 'NOUN'), (')', 'PUNCT'), ('.', 'PUNCT')]

>>> indexes = [m.span() for m in re.finditer('\([\w%]{0,5}\)',my_str,flags=re.IGNORECASE)]
>>> indexes
[(40, 45), (54, 59), (86, 90)]
>>> for start,end in indexes:
...     parsed.merge(start_idx=start,end_idx=end)
...
(79%)
(20%)
(1%)
>>> [(x.text,x.pos_) for x in parsed]
[('There', 'ADV'), ('are', 'VERB'), ('three', 'NUM'), ('things', 'NOUN'), ('in', 'ADP'), ('air', 'NOUN'), (',', 'PUNCT'), ('Nitrogen', 'PROPN'), ('(79%)', 'PUNCT'), (',', 'PUNCT'), ('oxygen', 'NOUN'), ('(20%)', 'PUNCT'), (',', 'PUNCT'), ('and', 'CCONJ'), ('other', 'ADJ'), ('types', 'NOUN'), ('of', 'ADP'), ('gases', 'NOUN'), ('(1%)', 'PUNCT'), ('.', 'PUNCT')]

Answer 2

最初在issue tracker here上写了一个答案，但是Stack Overflow绝对是解决此类问题的好地方。

我刚刚用最新版本测试了您的示例，令牌化看起来像这样：

['There', 'are', 'three', 'things', 'in', 'air', ',', 'Nitrogen', '(', '79', '%', ')', ',', 
'oxygen', '(', '20', '%', ')', ',', 'and', 'other', 'types', 'of', 'gases', '(', '1', '%', ')', '.']

这是解析树，对我来说看起来不错。（如果您想自己尝试一下，请注意，我将options={'collapse_punct': False, 'compact': True}设置为单独显示所有标点符号，并使大树更易于阅读。）

displacy

也就是说，您可能还会发现很多边缘情况，以及一些示例，这些示例说明了开箱即用的标记化规则无法针对标点符号和括号的所有组合进行泛化，或者预训练的解析器或标记器做出错误的预测。因此，如果您要处理括号中较长的插入内容，并且解析器对此进行了努力，则可能需要使用更多类似的示例对其进行微调。

孤立地查看单个句子不是很有帮助，因为它不能使您很好地了解数据的整体准确性以及要关注的重点。即使您训练了一个花哨的最新模型，该模型在数据上获得90％的准确性，这仍然意味着它所做的每10次预测都是错误的。

关于解析英语模型中的括号

2 个答案: