我正在尝试通过150 MB文件的whoosh创建索引。但它显示错误列表索引超出范围:我引用了导致错误的行。那是for x in range(len(id)):
。逻辑索引记录将等同于文档的ID号。
from whoosh import index
from whoosh.fields import Schema,ID, TEXT,NUMERIC
from whoosh import index
from whoosh.index import create_in
id = []
body = []
Score = []
count=0
doc_path='C:/Users/Abhi/Desktop/My_Experiments_with_truth/extracted_xml.txt'
with open(doc_path,'r+',encoding="utf8") as line:
for f in line:
count=count+1
if f.startswith('Id : '):
a = f.replace('Id : ','')
id.append(a)
#print(a)
elif f.startswith('body : '):
b = f.replace('body : ','')
body.append(b)
#print(b)
elif f.startswith('Score :'):
c = f.replace('Score :','')
Score.append(c)
#print(c)
if not os.path.exists("index"):
os.mkdir("index")
#design the Schema
schema=Schema(id_details=ID(stored=True),body_details=TEXT(stored=True),Score_details=NUMERIC(stored=True))
print(schema)
#creation of the index
ix = index.create_in("index", schema)
writer = ix.writer()
#Opening writer
for x in range(len(id)):
writer.add_document(id_details=id[x],body_details=body[x],Score_details=Score[x])
writer.commit()
print("Index created")
答案 0 :(得分:1)
我认为问题不在于嗖嗖声,而在于解析输入文件的方式。如果您在从输入文件中读取数据时不一致,您将获得不同大小的列表id, body, Score
,导致此行失败:
writer.add_document(id_details=id[x],body_details=body[x],Score_details=Score[x])
由于您只是与列表id
的限制进行比较:range(len(id))
尝试改进对文件的解析,或者至少将x与id, body, Score