Question

我需要对许多文本文件运行一种算法。为了对它们进行预处理，我使用Spacy，它具有使用不同语言进行了预训练的模型。由于预处理结果在算法的不同部分中使用，因此最好将它们保存一次在磁盘上并多次加载。但是，Spacy反序列化方法会出错。我写了一个简单的代码来显示错误：

de_nlp=spacy.load("de_core_news_sm",disable=['ner', 'parser'])
de_nlp.add_pipe(de_nlp.create_pipe('sentencizer'))
doc = de_nlp(text_file_content)

for ix, sent in enumerate(doc.sents, 1):
    print("--Sentence number {}: {}".format(ix, sent))
    lemma = [w.lemma_ for w in sent]
    print(f"Lemma ==> {lemma}")

#Serialization and Deserilization 
doc.to_disk("/tmp/test_result.bin")
new_doc = Doc(Vocab()).from_disk("/tmp/test_result.bin")

for ix, sent in enumerate(new_doc.sents, 1):
    print("--Sentence number {}: {}".format(ix, sent))
    lemma = [w.lemma_ for w in sent]
    print(f"Lemma ==> {lemma}")

但是，上面的示例代码出现以下错误：

Traceback (most recent call last): File "/tmp/test_result.bin", line 14, in <module>
    for ix, sent in enumerate(new_doc.sents, 1): 
File "doc.pyx", line 535, in __get__
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the dependency parser, or set sentence boundaries by setting doc[i].is_sent_start.

我使用的Spacy版本是2.0.18。

有关此主题的任何信息都将受到赞赏。

反序列化Spacy结果

0 个答案: