Question

在Python3中尝试使用NLTK中的Toktok单词标记器时

string='&& Test & and L&R '
from nltk.tokenize.toktok import ToktokTokenizer
ToktokTokenizer().tokenize(string)

我获得以下输出：

['&&amp;', 'Test', '&amp;', 'and', 'L&R']

看起来它以一种奇怪的方式转义了＆。我正在使用NLTK版本3.3和Python 3.6.4。

有人猜测为什么会发生这种情况以及解决它的有效方法吗？我知道我可以通过

查看答案

[tok.replace("&amp;","&") for tok in tokenized_sentence]

但似乎是肮脏的骇客。我想知道是否有一种方法不会首先产生这种效果。

Answer 1

如@snakecharmerb对于&所提到的那样，来源指出：

# Replace problematic character with numeric character reference.

解决此问题的一种方法是覆盖ToktokTokenizer实例上的字段，例如：

import re

from nltk.tokenize.toktok import ToktokTokenizer

string = '&& Test & and L&R '

tokenizer = ToktokTokenizer()
tokenizer.AMPERCENT = re.compile('& '), '& '
tokenizer.TOKTOK_REGEXES = [(regex, sub) if sub != '&amp; ' else (re.compile('& '), '& ') for (regex, sub) in
                            ToktokTokenizer.TOKTOK_REGEXES]

result = tokenizer.tokenize(string)
print(result)

输出

['&&', 'Test', '&', 'and', 'L&R']

为什么在Python NLTK中将＆标记为“＆”

1 个答案: