我的Whoosh索引大小为33 GB。它包含17个段,大小从30 MB到8 GB不等。
试图获取Searcher对象
from whoosh import index
ix = index.open_dir(ix_folder)
with ix.searcher() as searcher:
pass
引发内存错误:
Traceback (most recent call last):
...
with ix.searcher() as searcher:
File "C:\Program Files\Python3\lib\site-packages\whoosh\index.py", line 318, in searcher
return Searcher(self.reader(), fromindex=self, **kwargs)
File "C:\Program Files\Python3\lib\site-packages\whoosh\index.py", line 548, in reader
info.generation, reuse=reuse)
File "C:\Program Files\Python3\lib\site-packages\whoosh\index.py", line 535, in _reader
readers = [segreader(segment) for segment in segments]
File "C:\Program Files\Python3\lib\site-packages\whoosh\index.py", line 535, in <listcomp>
readers = [segreader(segment) for segment in segments]
File "C:\Program Files\Python3\lib\site-packages\whoosh\index.py", line 524, in segreader
generation=generation)
File "C:\Program Files\Python3\lib\site-packages\whoosh\reading.py", line 620, in __init__
self._terms = self._codec.terms_reader(self._storage, segment)
File "C:\Program Files\Python3\lib\site-packages\whoosh\codec\whoosh3.py", line 122, in terms_reader
postfile = segment.open_file(storage, self.POSTS_EXT)
File "C:\Program Files\Python3\lib\site-packages\whoosh\codec\base.py", line 556, in open_file
return storage.open_file(fname, **kwargs)
File "C:\Program Files\Python3\lib\site-packages\whoosh\filedb\filestore.py", line 333, in open_file
return self.a.open_file(name, *args, **kwargs)
File "C:\Program Files\Python3\lib\site-packages\whoosh\filedb\compound.py", line 121, in open_file
f = BufferFile(buf, name=name)
File "C:\Program Files\Python3\lib\site-packages\whoosh\filedb\structfile.py", line 357, in __init__
self.file = BytesIO(buf)
MemoryError
有时候,Whoosh会消耗500 MB的内存,而另一次是-所有可用内存(约4 GB)。有时候,Whoosh可以工作,但是在一分钟之内,无论它占用了多少内存,它都可能停止工作,而现在与我使用的计算机无关。
因此产生一个问题:如何解决大索引的内存错误?我试图重新创建没有帮助的索引。有什么方法可以限制内存消耗,以某种方式清除缓存等?创建一个大的细分市场不是出路。
这是索引结构:
def create_index(ix_folder):
schema = Schema(numid=ID(stored=True, unique=True),
date1=DATETIME(),
date2=DATETIME(),
days=NUMERIC(stored=True),
title=TEXT(stored=True, phrase=True, field_boost=1.15),
abstract=TEXT(phrase=True),
article_text=TEXT(phrase=True),
numc=NUMERIC(stored=True),
numq=NUMERIC(stored=True),
types=ID(),
)
index.create_in(ix_folder, schema)