我第一次尝试使用Whoosh进行文本搜索。我想搜索包含单词" XML"的文档。但是因为我是Whoosh的新手,所以我只是编写了一个程序来搜索文档中的单词。文档是文本文件(myRoko.txt)
import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *
if not os.path.exists("indexdir3"):
os.mkdir("indexdir3")
schema = Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"
with open(path, "r") as f:
content = f.read()
f.close()
writer.add_document(name=path, content= content)
writer.commit()
ix = open_dir("indexdir3")
query_b = QueryParser('content', ix.schema).parse('XML')
with ix.searcher() as srch:
res_b = srch.search(query_b)
print res_b[0]
上面的代码应该打印包含单词" XML"的文档。但是代码返回以下错误:
raise ValueError("%r is not unicode or sequence" % value)
ValueError: 'A large number of documents are now represented and stored
as XML document on the web. Thus ................
导致此错误的原因是什么?
答案 0 :(得分:1)
您遇到Unicode问题。您应该将unicode字符串传递给索引器。为此,您需要以unicode打开文本文件:
import codecs
with codecs.open(path, "r","utf-8") as f:
content = f.read()
并使用unicode字符串作为文件名:
path = u"myRoko.txt"
修复后我得到了这个结果:
<Hit {'name': u'myRoko.txt'}>
答案 1 :(得分:0)
writer.add_document(name=unicode(path), content=unicode(content))
必须是UNICODE