如何使用Whoosh创建索引

时间:2015-06-11 11:04:53

标签: python search indexing unicode whoosh

我第一次尝试使用Whoosh进行文本搜索。我想搜索包含单词" XML"的文档。但是因为我是Whoosh的新手,所以我只是编写了一个程序来搜索文档中的单词。文档是文本文件(myRoko.txt)

import os, os.path
from whoosh import index
from whoosh.index import open_dir
from whoosh.fields import Schema, ID, TEXT
from whoosh.qparser import QueryParser
from whoosh.query import *

if not os.path.exists("indexdir3"):
   os.mkdir("indexdir3")

schema =  Schema(name=ID(stored=True), content=TEXT)
ix = index.create_in("indexdir3", schema)
writer = ix.writer()
path = "myRoko.txt"

with open(path, "r") as f:
   content = f.read()
   f.close()
   writer.add_document(name=path, content= content)

  writer.commit()

  ix = open_dir("indexdir3")
  query_b = QueryParser('content', ix.schema).parse('XML')
  with ix.searcher() as srch:
    res_b = srch.search(query_b)
    print res_b[0]

上面的代码应该打印包含单词" XML"的文档。但是代码返回以下错误:

    raise ValueError("%r is not unicode or sequence" % value)

    ValueError: 'A large number of documents are now represented and stored      
    as XML document on the web. Thus ................

导致此错误的原因是什么?

2 个答案:

答案 0 :(得分:1)

您遇到Unicode问题。您应该将unicode字符串传递给索引器。为此,您需要以unicode打开文本文件:

import codecs
with codecs.open(path, "r","utf-8") as f:
   content = f.read()

并使用unicode字符串作为文件名:

path = u"myRoko.txt"

修复后我得到了这个结果:

<Hit {'name': u'myRoko.txt'}>

答案 1 :(得分:0)

writer.add_document(name=unicode(path), content=unicode(content))

必须是UNICODE