Is it possible to use spacy with already tokenized input?

时间:2018-12-03 13:17:34

标签: python nlp spacy

I have a sentence that has already been tokenized into words. I want to get the part of speech tag for each word in the sentence. When I check the documentation in SpaCy I realized it starts with the raw sentence. I don't want to do that because in that case, the spacy might end up with a different tokenization. Therefore, I wonder if using spaCy with the list of words (rather than a string) is possible or not ?

Here is an example about my question:

# I know that it does the following sucessfully :
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text = 'Hello, world.'
doc = nlp(raw_text)
for token in doc:
    print(token.pos_)

But I want to do something similar to the following:

import spacy
nlp = spacy.load('en_core_web_sm')
tokenized_text = ['Hello',',','world','.']
doc = nlp(tokenized_text)
for token in doc:
    print(token.pos_)

I know, it doesn't work, but is it possible to do something similar to that ?

2 个答案:

答案 0 :(得分:4)

您可以通过使用自己的替换spaCy的默认令牌生成器来做到这一点:

nlp.tokenizer = custom_tokenizer

custom_tokenizer是将原始文本作为输入并返回Doc对象的函数。

您未指定获取令牌列表的方式。如果您已经有一个接受原始文本并返回令牌列表的函数,则对其进行一些小的更改:

def custom_tokenizer(text):
    tokens = []

    # your existing code to fill the list with tokens

    # replace this line:
    return tokens

    # with this:
    return Doc(nlp.vocab, tokens)

请参见Doc上的documentation

如果由于某种原因您无法执行此操作(也许您无权使用令牌化功能),则可以使用字典:

tokens_dict = {'Hello, world.': ['Hello', ',', 'world', '.']}

def custom_tokenizer(text):
    if text in tokens_dict:
        return Doc(nlp.vocab, tokens_dict[text])
    else:
        raise ValueError('No tokenization available for input.')

无论哪种方式,您都可以像第一个示例一样使用管道:

doc = nlp('Hello, world.')

答案 1 :(得分:0)

如果标记化的文本不是恒定的,另一种选择是跳过标记化:

spacy_doc = Doc(nlp.vocab, words=tokenized_text)
for pipe in filter(None, nlp.pipeline):
    pipe[1](spacy_doc)