对于一个非常具体的案例,这是一个非常具体的问题。我希望Stackoverflow仍然是问这个问题的正确地方。
使用nlp.pipe(docs)
与nlp(doc)
的结果似乎有所不同。当我使用前者而不是后者时,在某些情况下我不会收到向量。这是我注意到的:
en_core_web_sm
)与disabled标记器和解析器一起使用时,我得到的词向量是nlp(doc)
而不是nlp.pipe(docs)
en_core_web_lg
)在两种情况下都会生成向量有人可以向我解释吗?特别是为什么标记器似乎会对此产生影响?
代码:
import spacy
documents = [
"spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.",
"If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?",
"spaCy is designed specifically for production use and helps you build applications that process and \"understand\" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning."
]
nlp = spacy.load('en', disable=['tagger', 'parser'])
print('classic model, disabled tagger and parser, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled tagger and parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en', disable=['parser'])
print('classic model, disabled only parser, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled only parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en', disable=['tagger'])
print('classic model, disabled only tagger, without pipe')
for i, doc in enumerate(documents):
spacy_doc = nlp(doc)
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
print('classic model, disabled only tagger, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
nlp = spacy.load('en_core_web_lg', disable=['tagger', 'parser'])
print('model en_core_web_lg, disabled tagger and parser, with pipe')
for i, spacy_doc in enumerate(nlp.pipe(documents)):
print(f'document {i} has_vector: {spacy_doc.has_vector}, is_parsed: {spacy_doc.is_parsed}, token_0_pos: {spacy_doc[0].pos_}')
输出:
classic model, disabled tagger and parser, without pipe
document 0 has_vector: True, is_parsed: False, token_0_pos:
document 1 has_vector: True, is_parsed: False, token_0_pos:
document 2 has_vector: True, is_parsed: False, token_0_pos:
classic model, disabled tagger and parser, with pipe
document 0 has_vector: False, is_parsed: False, token_0_pos:
document 1 has_vector: False, is_parsed: False, token_0_pos:
document 2 has_vector: False, is_parsed: False, token_0_pos:
classic model, disabled only parser, without pipe
document 0 has_vector: True, is_parsed: False, token_0_pos: ADJ
document 1 has_vector: True, is_parsed: False, token_0_pos: ADP
document 2 has_vector: True, is_parsed: False, token_0_pos: ADJ
classic model, disabled only parser, with pipe
document 0 has_vector: True, is_parsed: False, token_0_pos: ADJ
document 1 has_vector: True, is_parsed: False, token_0_pos: ADP
document 2 has_vector: True, is_parsed: False, token_0_pos: ADJ
classic model, disabled only tagger, without pipe
document 0 has_vector: True, is_parsed: True, token_0_pos:
document 1 has_vector: True, is_parsed: True, token_0_pos:
document 2 has_vector: True, is_parsed: True, token_0_pos:
classic model, disabled only tagger, with pipe
document 0 has_vector: False, is_parsed: True, token_0_pos:
document 1 has_vector: False, is_parsed: True, token_0_pos:
document 2 has_vector: False, is_parsed: True, token_0_pos:
model en_core_web_lg, disabled tagger and parser, with pipe
document 0 has_vector: True, is_parsed: False, token_0_pos:
document 1 has_vector: True, is_parsed: False, token_0_pos:
document 2 has_vector: True, is_parsed: False, token_0_pos: