如何在标记字符串时阻止spacy的标记生成器拆分特定的子字符串?
更具体地说,我有这句话:
一旦取消注册,文件夹就会脱离shell。
由scapy 1.6.0标记为[Once / unregistered /,/ / folder / go / away / from / she / ll /。]。我不希望子字符串shell
被切割为两个不同的标记she
和ll
。
以下是我使用的代码:
# To install spacy:
# sudo pip install spacy
# sudo python -m spacy.en.download parser # will take 0.5 GB
import spacy
nlp = spacy.load('en')
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Once unregistered, the folder went away from the shell.')
for token in document:
print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
format(token.idx, token.text,token.i,token.pos_))
输出:
token.i: 0 token.idx: 0 token.pos: ADV token.text: Once
token.i: 1 token.idx: 5 token.pos: ADJ token.text: unregistered
token.i: 2 token.idx: 17 token.pos: PUNCT token.text: ,
token.i: 3 token.idx: 19 token.pos: DET token.text: the
token.i: 4 token.idx: 23 token.pos: NOUN token.text: folder
token.i: 5 token.idx: 30 token.pos: VERB token.text: went
token.i: 6 token.idx: 35 token.pos: ADV token.text: away
token.i: 7 token.idx: 40 token.pos: ADP token.text: from
token.i: 8 token.idx: 45 token.pos: DET token.text: the
token.i: 9 token.idx: 49 token.pos: PRON token.text: she
token.i: 10 token.idx: 52 token.pos: VERB token.text: ll
token.i: 11 token.idx: 54 token.pos: PUNCT token.text: .
答案 0 :(得分:2)
spacy允许add exceptions to the tokenizer。
添加例外以防止字符串shell
被标记器拆分可以使用nlp.tokenizer.add_special_case
完成,如下所示:
import spacy
from spacy.symbols import ORTH, LEMMA, POS
nlp = spacy.load('en')
nlp.tokenizer.add_special_case(u'shell',
[
{
ORTH: u'shell',
LEMMA: u'shell',
POS: u'NOUN'}
])
# https://spacy.io/docs/usage/processing-text
document = nlp(u'Once unregistered, the folder went away from the shell.')
for token in document:
print('token.i: {2}\ttoken.idx: {0}\ttoken.pos: {3:10}token.text: {1}'.
format(token.idx, token.text,token.i,token.pos_))
输出:
token.i: 0 token.idx: 0 token.pos: ADV token.text: Once
token.i: 1 token.idx: 5 token.pos: ADJ token.text: unregistered
token.i: 2 token.idx: 17 token.pos: PUNCT token.text: ,
token.i: 3 token.idx: 19 token.pos: DET token.text: the
token.i: 4 token.idx: 23 token.pos: NOUN token.text: folder
token.i: 5 token.idx: 30 token.pos: VERB token.text: went
token.i: 6 token.idx: 35 token.pos: ADV token.text: away
token.i: 7 token.idx: 40 token.pos: ADP token.text: from
token.i: 8 token.idx: 45 token.pos: DET token.text: the
token.i: 9 token.idx: 49 token.pos: NOUN token.text: shell
token.i: 10 token.idx: 54 token.pos: PUNCT token.text: .