Question

我有一个文本文件，其中每行都有一堆文本。（在实际文件中没有行号），如下所示：

line#:     text:
0          This is some text
1          More text
2          whats for lunch

我想要一个函数，它返回一个字典，将每个单词映射到它的行号，主要是设计一个反向索引。

即。 {'This':{1}, 'text':{0,1}, 'for':{2} ... }

扫描文本文件后（这需要.18秒）我将这些行放入列表列表中，这样列表中的每个位置都会存储拆分行。即：

[['This', 'is', 'some', 'text'], ['More', ...] ...]

之后我使用enumerate()提取位置并创建字典。我已经有一个解决方案，但它太丑了，我花了很长时间才想看到另一个更优雅的解决方案。

作为参考，我的算法运行882.28秒，即1099行和753210字的15分钟。换句话说，绝对不是pythonic。

def invidx(strlist):
    # return algoritm execution time
    start = time.time()  

    f = open(strlist, 'r')
    wordLoc = []
    for line in f:    
        s = line.split()
        wordLoc.append(list(s)) 
    f.close()

    # benchmark
    print 'job completed in %.2fs' % (time.time() - start) 

    try:
        q = {}
        for a, b in enumerate(wordLoc):
            l = set()
            for w in b :
                if w not in q:
                    l = {a for a, b in enumerate(wordLoc) if w in b}
                    q[w] = l
    except KeyboardInterrupt:
        print 'Interrupt detected: aborting...'
        print 'Failed to complete indexing, ran for %.2fs' % \
            (time.time() - start)
        exit(0)                  

    return q

编辑：

根据请求代码在上面。对我们说好话。

Answer 1

您可以在最初扫描文件时使用enumerate获取行号，并在您前往set的dict中添加行号。

myfile.txt的：

a b c
b x y
a c b

索引它：

index = {}
with open('myfile.txt') as F:
    for line_num, line in enumerate(F):
        for word in line.split():
            index.setdefault(word, set()).add(line_num)

index
=> {'a': set([0, 2]),
 'b': set([0, 1, 2]),
 'c': set([0, 2]),
 'x': set([1]),
 'y': set([1])}

Answer 2

导致经济放缓的原因是：

l = {a for a, b in enumerate(wordLoc) if w in b}

每当你找到一个你还没有看过的单词时，你重新枚举每一行，看看是否包含单词。这将总体上贡献O（NumberOfUniqueWords * NumberOfLines）操作，这是输入大小的二次方。

你已经列举了每一行的每一个字。为什么不随便添加它们呢？

for w in b :
    if w not in q: q[w] = []
    q[w].append(a)

这应该是O（NumberOfWords）时间，它是输入大小的线性而不是二次（ish）。你触摸每一个东西，而不是每个独特的单词。

Answer 3

您可以使用collections.defaultdict：

from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
   for i,line in enumerate(f): #enumerate returns the line number as well as the line
       words = line.split()    #splt the line using str.split()
       for word in words:      #iterate over words and add to it's corresponding set
           dic[word.lower()].add(i)
print dic

<强>输出：

defaultdict(<type 'set'>,
{'whats': set([2]),
 'for': set([2]),
 'this': set([0]),
 'text': set([0, 1]),
 'is': set([0]),
 'some': set([0]),
 'lunch': set([2]),
 'more': set([1])})

Answer 4

这似乎有效，我相信它比你的版本更快：

from time import time

def invidx(strlist):
    # return algoritm execution time
    start = time()

    wordLocs = []
    unique_words = set()
    with open(strlist, 'r') as f:
        for line in f:
            words = line.split()
            unique_words.update(words)
            wordLocs.append(set(words))

    # benchmark
    print 'job completed in %.2fs' % (time() - start)

    try:
        q = {}
        for unique_word in unique_words:
            occurrences = set()
            for line, words in enumerate(wordLocs):
                if unique_word in words:
                    occurrences.add(line)
            q[unique_word] = occurrences

    except KeyboardInterrupt:
        print ('Interrupt detected: aborting...\n'
              ('Failed to complete indexing, ran for %.2fs' % (time() - start)))
        exit(0)

    return q

from pprint import pprint
pprint(invidx('strlist.txt'))

来自普通测试文件的输出：

job completed in 0.00s
{'More': set([1]),
 'This': set([0]),
 'for': set([2]),
 'is': set([0]),
 'lunch': set([2]),
 'some': set([0]),
 'text': set([0, 1]),
 'whats': set([2])}

我的逆指数是非常慢的任何建议？

4 个答案: