Question

需要了解spaCy的en和en_core_web_sm模型之间的区别。

我正在尝试对Spacy执行NER。（用于组织名称）请在下面找到我正在使用的脚本

import spacy
nlp = spacy.load("en_core_web_sm")
text = "But Google is starting from behind. The company made a late push \
    into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \ 
    Alexa software, which runs on its Echo and Dot devices, have clear 
    leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

以上没有提供任何输出。但是当我使用“ en”模型

import spacy
nlp = spacy.load("en")
text = "But Google is starting from behind. The company made a late push \
    into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
    Alexa software, which runs on its Echo and Dot devices, have clear 
    leads in consumer adoption."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

它为我提供了所需的输出： Google 4 10 ORG 苹果的Siri 92104 ORG iPhone 119126 ORG 亚马逊132138 ORG 回声和点182194 ORG

这是怎么回事？请帮忙。

我可以使用en_core_web_sm模型获得与en模型相同的输出。如果是这样，请提出建议。要求以pandas df为输入的Python 3脚本。谢谢

Answer 1

因此，每个模型都是在特定语料库（文本“数据集”）之上训练的机器学习模型。这样一来，每个模型都可以用不同的标签标记条目-尤其是因为某些模型的训练数据少于其他模型。

当前，Spacy提供了4种英语模型，如https://spacy.io/models/en/

所示。

根据https://github.com/explosion/spacy-models，可以通过几种不同的方式下载模型：

# download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm

# out-of-the-box: download best-matching default model
python -m spacy download en

可能是，当您下载“ en”模型时，最匹配的默认模型不是“ en_core_web_sm”。

此外，请记住，这些模型会不时地更新，这可能导致您拥有同一模型的两个不同版本。

Answer 2

在我的系统中，两种情况下的结果都是相同的

代码：-

import spacy
nlp = spacy.load("en_core_web_sm")
text = """But Google is starting from behind. The company made a late push 
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s  
Alexa software, which runs on its Echo and Dot devices, have clear 
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
   print(ent.text, ent.start_char, ent.end_char, ent.label_)

import spacy
nlp = spacy.load("en")
text = """But Google is starting from behind. The company made a late push \
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s \
Alexa software, which runs on its Echo and Dot devices, have clear 
leads in consumer adoption."""
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Answer 3

加载spacy.load('en_core_web_sm')而非spacy.load('en')应该会有所帮助。

Spacy EN模型问题

3 个答案: