Question

我想知道如何将来自spacy的预训练转换器模型en_trf_bertbaseuncased_lg用于将来的NLP任务（NER，POS等）。文档指出，该模块只能用于以下管道预处理模块（https://spacy.io/models/en#en_trf_bertbaseuncased_lg）：

增敏剂
trf_wordpiecer
trf_tok2vec

谁能向我解释这些组件在做什么以及可以在哪些任务中使用？还是有人知道一个很好的资料来阅读它？

>>> import spacy
>>> nlp = spacy.load("en_trf_bertbaseuncased_lg")
>>> nlp.pipe_names
[sentencizer, trf_wordpiecer, trf_tok2vec]

Answer 1

SpaCy的Language Processing Pipelines应该指导您获取所需的信息。

Sentencizer：一个简单的管道组件，允许不需要依赖项解析的自定义句子边界检测逻辑。默认情况下，句子分段由DependencyParser执行，因此Sentencizer使您可以实施更简单的基于规则的策略，而无需加载统计模型。也可以通过字符串名称"sentencizer"使用该组件。初始化后，通常使用nlp.add_pipe将其添加到处理管道。

Answer 2

从文档中我发现，您可以使用BERT spacy模型en_trf_bertbaseuncased_lg模型来获取句子标记的词嵌入。

https://spacy.io/universe/project/spacy-transformers

安装spacy BERT模型和spacy-transformers模块：

pip install spacy-transformers
python -m spacy download en_trf_bertbaseuncased_lg

请参阅以下官方文档中稍作修改的示例：

import spacy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
doc = nlp("Apple shares rose on the news. Apple pie is delicious.")

# doc[0] accesses the emmbedding of the first token = 'Apple'
print(doc[0])  # Apple


# doc[7] accesses the emmbedding of the 8th token = 'Apple'
print(doc[7])  # Apple

# they are not the same, because the embedding are context sentitive (check with cosine similarity)    
print(doc[0].similarity(doc[7])) # 0.43365735

# get the shape of the last hidden layer
print(doc._.trf_last_hidden_state.shape)  # (16, 768)

# get the word embeddings for all tokens
print(doc._.trf_last_hidden_state)

[[0.34356186 -0.23586863 -0.06684081 ... -0.17081839 0.60623395 0.15930347] [0.65008235 0.01991967 -0.11502014 ... -0.5832436 -0.02782568 -0.4122076] [1.499611 0.02532474 0.23262465 ... -0.3212682 0.27605072 0.18531999] ... [0.24741858 0.00736329 -0.28011537 ... -0.1693347 -0.20065884 -0.6950048] [0.2859776 0.00303659 -0.37783793 ... 0.37828052 0.041728 -0.5648101] [0.7320844 0.11067941 -0.04100507 ... 0.25042596 -0.21909392 -0.31274694]]

然后可以在任何进一步的下游NLP /机器学习任务中使用嵌入词。

Answer 3

trf_wordpiecer组件

可通过doc._.trf_alignment
执行模型的字词预处理

引用docs：

Wordpiece可方便地训练神经网络，但不能产生与任何语言概念相匹配的细分 “字”。最罕见的单词将映射到多个单词标记，并且有时，对齐方式将是多对多。

trf_tok2vec组件

可通过doc._.trf_last_hidden_state
存储转换器的原始输出：一个张量，每个词片令牌一行
但是，您可能想要的是doc.tensor中的令牌对齐功能。

另请参见此blog article介绍spacy的变压器集成。

如何在SpaCy中使用预训练的变压器模型（“ en_trf_bertbaseuncased_lg”）？

3 个答案: