我正在使用Whoosh索引和搜索各种编码的各种文本。但是,在对索引文件执行搜索时,某些匹配结果不会出现在使用“突出显示”功能的输出中。我觉得这与编码错误有关,但我无法弄清楚可能会阻止显示所有结果的内容。我会非常感激别人可以解开这个谜团。
以下是我用来创建索引的脚本,here是我正在编制索引的文件:
from whoosh.index import create_in
from whoosh.fields import *
import glob, os, chardet
encodings = ['utf-8', 'ISO-8859-2', 'windows-1250', 'windows-1252', 'latin1', 'ascii']
def determine_string_encoding(string):
result = chardet.detect(string)
string_encoding = result['encoding']
return string_encoding
#specify a list of paths that contain all of the texts we wish to index
text_dirs = [
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\hume",
"C:\Users\Douglas\Desktop\intertextuality\sample_datasets\complete_pope\clean"
]
#establish the schema to be used when storing texts; storing content allows us to retrieve hightlighted extracts from texts in which matches occur
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True))
#check to see if we already have an index directory. If we don't, make it
if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)
#create writer object we'll use to write each of the documents in text_dir to the index
writer = ix.writer()
#create file in which we can write the encoding of each file to disk for review
with open("encodings_log.txt","w") as encodings_out:
#for each directory in our list
for i in text_dirs:
#for each text file in that directory (j is now the path to the current file within the current directory)
for j in glob.glob( i + "\\*.txt" ):
#first, let's grab j title. If the title is stored in the text file name, we can use this method:
text_title = j.split("\\")[-1]
#now let's read the file
with open( j, "r" ) as text_content:
text_content = text_content.read()
#use method defined above to determine encoding of path and text_content
path_encoding = determine_string_encoding(j)
text_content_encoding = determine_string_encoding(text_content)
#because we know the encoding of the files in this directory, let's override the previous text_content_encoding value and specify that encoding explicitly
if "clean" in j:
text_content_encoding = "iso-8859-1"
#decode text_title, path, and text_content to unicode using the encodings we determined for each above
unicode_text_title = unicode(text_title, path_encoding)
unicode_text_path = unicode(j, path_encoding)
unicode_text_content = unicode(text_content, text_content_encoding)
#use writer method to add document to index
writer.add_document( title = unicode_text_title, path = unicode_text_path, content = unicode_text_content )
#after you've added all of your documents, commit changes to the index
writer.commit()
该代码似乎没有任何问题索引文本,但是当我使用以下脚本来解析索引时,我在out.txt输出文件中得到三个空白值 - 前两行是空的,第六行是空的,但我希望这三行不是空的。这是我用来解析索引的脚本:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
#to enable Levenshtein-based parse, use plugin
parser.add_plugin(FuzzyTermPlugin())
#using ~2/3 means: allow for edit distance of two (where additions, subtractions, and insertions each cost one), but only count matches for which first three letters match. Increasing this denominator greatly increases speed
query = parser.parse(u"swallow~2/3")
results = searcher.search(query)
#see see whoosh.query.phrase, which describes "slop" parameter (ie: number of words we can insert between any two words in our search query)
#write query results to disk or html
with codecs.open("out.txt","w") as out:
for i in results[0:]:
title = i["title"]
highlight = i.highlights("content")
clean_highlight = " ".join(highlight.split())
out.write(clean_highlight.encode("utf-8") + "\n")
如果有人可以说明为什么这三行是空的,我会永远感激。
答案 0 :(得分:3)
Holy Moly,我可能已经想到了这一点!看来我的一些文本文件(包括路径中“hume”的两个文件)都超过了控制Whoosh索引创建行为的阈值。如果试图索引一个太大的文件,Whoosh似乎将该文本存储为字符串值,而不是unicode值。因此,假设一个索引包含字段“path”(文件路径),“title”(文件标题),“content”(文件内容)和“encoding”(当前文件的编码),可以测试是否通过运行如下脚本正确索引该索引中的文件:
from whoosh.qparser import QueryParser
from whoosh.qparser import FuzzyTermPlugin
from whoosh.index import open_dir
import codecs
#now that we have an index, we can open it with open_dir
ix = open_dir("index")
phrase_to_search = unicode("swallow")
with ix.searcher() as searcher:
parser = QueryParser("content", schema=ix.schema)
query = parser.parse( phrase_to_search )
results = searcher.search(query)
for hit in results:
hit_encoding = (hit["encoding"])
with codecs.open(hit["path"], "r", hit_encoding) as fileobj:
filecontents = fileobj.read()
hit_highlight = hit.highlights("content", text=filecontents)
hit_title = (hit["title"])
print type(hit_highlight), hit["title"]
如果任何打印值的类型为“str”,那么荧光笔似乎将指定文件的一部分视为字符串而不是unicode。
以下两种方法可以解决此问题:1)将大文件(anything over 32K characters)拆分为较小的文件 - 所有文件都应包含< 32K字符 - 并索引那些较小的文件。这种方法需要更多的策划,但确保合理的处理速度。 2)将参数传递给结果变量以增加可以作为unicode存储的最大字符数,因此,在上面的示例中,正确地打印到终端。要在上面的代码中实现此解决方案,可以在定义results
的行之后添加以下行:
results.fragmenter.charlimit = 100000
添加此行可以将指定文件的前100000个字符的任何结果打印到终端,但会显着增加处理时间。或者,可以使用results.fragmenter.charlimit = None
完全删除字符限制,但这在处理大文件时确实会增加处理时间......