多个SpaCy文档对象,并希望将它们组合为一个对象

时间:2019-07-11 15:18:30

标签: python python-3.x spacy doc

我试图标记詹姆斯国王圣经的文本文件,但是当我尝试它时,出现内存错误。因此,我将文本分为多个对象。现在,我想使用spaCy标记对象,然后将它们重新组合为一个doc对象。我见过其他人在谈论类似的问题,并转换为数组,然后在组合数组后再返回文档。这项工作会解决我的问题,还是以后再创建新问题?

我尝试运行此程序,但是colab或我的计算机都没有支持它的RAM。

nlp_spacy = spacy.load('en')
kjv_bible  = gutenberg.raw('bible-kjv.txt')

#pattern for bracketed text titles
bracks = "[\[].*?[\]]"

kjv_bible = re.sub(bracks, "", kjv_bible)

kjv_bible =  ' '.join(kjv_bible.split())

len(kjv_bible)

kjv_bible_doc = nlp_spacy(kjv_bible)

ValueError                                Traceback (most recent call 
last)
<ipython-input-19-385936fadd40> in <module>()
----> 1 kjv_bible_doc = nlp_spacy(kjv_bible)

/usr/local/lib/python3.6/dist-packages/spacy/language.py in 
__call__(self, text, disable, component_cfg)
    378         if len(text) > self.max_length:
    379             raise ValueError(
--> 380                 Errors.E088.format(length=len(text), 
max_length=self.max_length)
    381             )
    382         doc = self.make_doc(text)

ValueError: [E088] Text of length 4305663 exceeds maximum of 1000000. 
The v2.x parser and NER models require roughly 1GB of temporary memory 
per 100,000 characters in the input. This means long texts may cause 
memory allocation errors. If you're not using the parser or NER, it's 
probably safe to increase the `nlp.max_length` limit. The limit is in 
number of characters, so you can check whether your inputs are too 
long by checking `len(text)`.



nlp.max_length = 4305663
kjv_bible_doc = nlp_spacy(kjv_bible)

由于RAM内存导致笔记本电脑崩溃

可以正常工作

np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array.extend(np_array2)
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)

1 个答案:

答案 0 :(得分:0)

如果增加max_length,除非您明确禁用使用大量内存的组件(解析器和NER),否则它将崩溃。如果仅使用令牌生成器,则可以在加载模型时禁用除令牌生成器之外的所有功能:

nlp = spacy.load('en', disable=['tagger', 'parser', 'ner'])