在文本文件中建立索引和搜索

时间:2016-02-29 06:44:52

标签: python text indexing full-text-search

我有一个包含书籍内容的文本文件。我想获取此文件并构建一个索引,允许用户搜索文件以进行搜索。

搜索将包括输入单词。然后,程序将返回以下内容:

  • 包含该词的每一章。
  • 该行的行号 其中包含这个词。
  • 这个词的整行。

我尝试了以下代码:

infile =   open(file)

Dict = {}

word = input("Enter a word to search: ")

linenum = 0
line = infile.readline()
for line in infile
    linenum += 1
    for word in wordList:
        if word in line:
            Dict[word] = Dict.setdefault(word, []) + [linenum]
            print(count, word)
    line = infile.readline()

return Dict

这样的东西不起作用,对于处理需要的其他模块似乎太尴尬了:

  • 搜索单词或其他单词的“或”运算符
  • 在同一章中搜索一个单词和另一个单词的“and”运算符

任何建议都会很棒。

1 个答案:

答案 0 :(得分:1)

def classify_lines_on_chapter(book_contents):
    lines_vs_chapter = []
    for line in book_contents:
        if line.isupper():
            current_chapter = line.strip()
        lines_vs_chapter.append(current_chapter)
    return lines_vs_chapter


def classify_words_on_lines(book_contents):
    words_vs_lines = {}
    for i, line in enumerate(book_contents):
        for word in set([word.strip(string.punctuation) for word in line.split()]):
            if word:
                words_vs_lines.setdefault(word, []).append(i)
    return words_vs_lines


def main():
    skip_lines = 93

    with open('book.txt') as book:
        book_contents = book.readlines()[skip_lines:]

    lines_vs_chapter = classify_lines_on_chapter(book_contents)
    words_vs_lines = classify_words_on_lines(book_contents)

    while True:
        word = input("Enter word to search - ")
        # Enter a blank input to exit
        if not word:
            break

        line_numbers = words_vs_lines.get(word, None)
        if not line_numbers:
            print("Word not found!!\n")
            continue

        for line_number in line_numbers:
            line = book_contents[line_number]
            chapter = lines_vs_chapter[line_number]
            print("Line " + str(line_number + 1 + skip_lines))
            print("Chapter '" + str(chapter) + "'")
            print(line)


if __name__ == '__main__':
    main()

this input file上试用。在运行之前将其重命名为book.txt