Question

我正在研究一种基于语法的简单解析器。为此我需要首先标记输入。在我的文本中出现了许多城市（例如，纽约，旧金山等）。当我只使用标准的nltk word_tokenizer时，所有这些城市都被拆分了。

from nltk import word_tokenize
word_tokenize('What are we going to do in San Francisco?')

当前输出：

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San', 'Francisco', '?']

期望的输出：

['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']

如何在不拆分命名实体的情况下对这些句子进行标记？

Answer 1

识别命名实体，然后遍历结果并将分块标记连接在一起：

>>> from nltk import ne_chunk, pos_tag, word_tokenize
>>> toks = word_tokenize('What are we going to do in San Francisco?')
>>> chunks = ne_chunk(pos_tag(toks))
>>> [ w[0] if isinstance(w, tuple) else " ".join(t[0] for t in w) for w in chunks ]
['What', 'are', 'we', 'going', 'to', 'do', 'in', 'San Francisco', '?']

chunks的每个元素都是(word, pos)元组或包含大块部分的Tree()。

NLTK标记化但不拆分命名实体

1 个答案: