使用Spacy进行令牌化-如何获取左右令牌

时间:2019-05-23 12:40:42

标签: python nlp token spacy

我正在使用Spacy进行文本标记化并陷入困境:

import spacy
nlp = spacy.load("en_core_web_sm")
mytext = "This is some sentence that spacy will not appreciate"
doc = nlp(mytext)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_, token.is_alpha, token.is_stop)

返回某种在我看来是说令牌化成功的东西:

This this DET DT nsubj Xxxx True False 
is be VERB VBZ ROOT xx True True 
some some DET DT det xxxx True True 
sentence sentence NOUN NN attr xxxx True False 
that that ADP IN mark xxxx True True 
spacy spacy NOUN NN nsubj xxxx True False 
will will VERB MD aux xxxx True True 
not not ADV RB neg xxx True True 
appreciate appreciate VERB VB ccomp xxxx True False

但另一方面

[token.text for token in doc[2].lefts]

返回一个空列表。 lefts/rights中有错误吗?

自然语言处理的初学者,希望我不要陷入概念上的陷阱。使用Spacy v'2.0.4'。

1 个答案:

答案 0 :(得分:1)

这句话的依赖性如下:

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp(u"This is some sentence that spacy will not appreciate")
for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
            [child for child in token.children])
This nsubj is VERB []
is ROOT is VERB [This, sentence]
some det sentence NOUN []
sentence attr is VERB [some, appreciate]
that mark appreciate VERB []
spacy nsubj appreciate VERB []
will aux appreciate VERB []
not neg appreciate VERB []
appreciate relcl sentence NOUN [that, spacy, will, not]

因此,我们看到doc[2](“一些”)具有空子向量。但是,“ is”(doc[1])不是。如果我们改为运行...

print([token.text for token in doc[1].lefts])  
print([token.text for token in doc[1].rights])  

我们得到...

['This']
['sentence']

您正在使用的函数在依赖关系树而非文档中进行导航,因此为什么在某些单词上得到空结果。

如果您只想要上一个和下一个令牌,则可以执行以下操作...

for ix, token in enumerate(doc):
    if ix == 0:
        print('Previous: %s, Current: %s, Next: %s' % ('', doc[ix], doc[ix + 1]))
    elif ix == (len(doc) - 1):
        print('Previous: %s, Current: %s, Next: %s' % (doc[ix - 1], doc[ix], ''))
    else:
        print('Previous: %s, Current: %s, Next: %s' % (doc[ix - 1], doc[ix], doc[ix + 1]))

或...

for ix, token in enumerate(doc):
    print('Previous: %s' % doc[:ix])
    print('Current: %s' % doc[ix])
    print('Following: %s' % doc[ix:])