创建单词位置的文档索引

时间:2017-02-06 13:57:25

标签: python python-3.x indexing

问题:

我想通过在python中创建一个数据结构来执行索引,该数据结构将存储来自给定文本文件的所有单词,并且还将存储其行号(这些单词出现的所有行)以及单词的位置(列#)在该特定行中。

到目前为止,我可以通过在列表中附加所有行号来将单词存储在字典中,但我无法将其位置存储在该特定行中。

我需要这种数据结构来更快地搜索文本文件。

这是我的代码到目前为止:

from collections import defaultdict
thetextfile = open('file.txt','r')
thetextfile = thetextfile.read()
file_s = thetextfile.split("\n")
wordlist = defaultdict(list)
lineNumber = 0
for (i,line) in enumerate(file_s):

    lineNumber = i
    for word in line.split(" "):
       wordlist[word].append(lineNumber)

print(wordlist)

1 个答案:

答案 0 :(得分:0)

以下是一些代码,用于存储文本文档中单词的行号和列:

from collections import defaultdict, namedtuple

# build a named tuple for the word locations
Location = namedtuple('Location', 'line col')

# dict keyd by word in document
word_locations = defaultdict(list)

# go through each line in the document
for line_num, line in enumerate(open('my_words.txt', 'r').readlines()):
    column = -1
    prev_col = 0

    # process the line, one word at a time
    while True:   
        if prev_col < column:
            word = line[prev_col:column]
            word_locations[word].append(Location(line_num, prev_col))
        prev_col = column+1

        # find the next space
        column = line.find(' ', prev_col)

        # check for more spaces on the line
        if column == -1:

            # there are no more spaces on the line, store the last word
            word = line[prev_col:column]
            word_locations[word].append(Location(line_num, prev_col))

            # go onto the next line
            break

print(word_locations)