我的逆指数是非常慢的任何建议?

时间:2013-07-05 19:32:13

标签: python algorithm

我有一个文本文件,其中每行都有一堆文本。 (在实际文件中没有行号),如下所示:

line#:     text:
0          This is some text
1          More text
2          whats for lunch

我想要一个函数,它返回一个字典,将每个单词映射到它的行号,主要是设计一个反向索引。

即。 {'This':{1}, 'text':{0,1}, 'for':{2} ... }

扫描文本文件后(这需要.18秒)我将这些行放入列表列表中,这样列表中的每个位置都会存储拆分行。即:

[['This', 'is', 'some', 'text'], ['More', ...] ...]

之后我使用enumerate()提取位置并创建字典。我已经有一个解决方案,但它太丑了,我花了很长时间才想看到另一个更优雅的解决方案。

作为参考,我的算法运行882.28秒,即1099行和753210字的15分钟。换句话说,绝对不是pythonic。

def invidx(strlist):
    # return algoritm execution time
    start = time.time()  

    f = open(strlist, 'r')
    wordLoc = []
    for line in f:    
        s = line.split()
        wordLoc.append(list(s)) 
    f.close()

    # benchmark
    print 'job completed in %.2fs' % (time.time() - start) 

    try:
        q = {}
        for a, b in enumerate(wordLoc):
            l = set()
            for w in b :
                if w not in q:
                    l = {a for a, b in enumerate(wordLoc) if w in b}
                    q[w] = l
    except KeyboardInterrupt:
        print 'Interrupt detected: aborting...'
        print 'Failed to complete indexing, ran for %.2fs' % \
            (time.time() - start)
        exit(0)                  

    return q

编辑:

根据请求代码在上面。对我们说好话。

4 个答案:

答案 0 :(得分:3)

您可以在最初扫描文件时使用enumerate获取行号,并在您前往set的dict中添加行号。

myfile.txt的:

a b c
b x y
a c b

索引它:

index = {}
with open('myfile.txt') as F:
    for line_num, line in enumerate(F):
        for word in line.split():
            index.setdefault(word, set()).add(line_num)

index
=> {'a': set([0, 2]),
 'b': set([0, 1, 2]),
 'c': set([0, 2]),
 'x': set([1]),
 'y': set([1])}

答案 1 :(得分:2)

导致经济放缓的原因是:

l = {a for a, b in enumerate(wordLoc) if w in b}

每当你找到一个你还没有看过的单词时,你重新枚举每一行,看看是否包含单词。这将总体上贡献O(NumberOfUniqueWords * NumberOfLines)操作,这是输入大小的二次方。

你已经列举了每一行的每一个字。为什么不随便添加它们呢?

for w in b :
    if w not in q: q[w] = []
    q[w].append(a)

这应该是O(NumberOfWords)时间,它是输入大小的线性而不是二次(ish)。你触摸每一个东西,而不是每个独特的单词。

答案 2 :(得分:1)

您可以使用collections.defaultdict

from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
   for i,line in enumerate(f): #enumerate returns the line number as well as the line
       words = line.split()    #splt the line using str.split()
       for word in words:      #iterate over words and add to it's corresponding set
           dic[word.lower()].add(i)
print dic

<强>输出:

defaultdict(<type 'set'>,
{'whats': set([2]),
 'for': set([2]),
 'this': set([0]),
 'text': set([0, 1]),
 'is': set([0]),
 'some': set([0]),
 'lunch': set([2]),
 'more': set([1])})

答案 3 :(得分:0)

这似乎有效,我相信它比你的版本更快:

from time import time

def invidx(strlist):
    # return algoritm execution time
    start = time()

    wordLocs = []
    unique_words = set()
    with open(strlist, 'r') as f:
        for line in f:
            words = line.split()
            unique_words.update(words)
            wordLocs.append(set(words))

    # benchmark
    print 'job completed in %.2fs' % (time() - start)

    try:
        q = {}
        for unique_word in unique_words:
            occurrences = set()
            for line, words in enumerate(wordLocs):
                if unique_word in words:
                    occurrences.add(line)
            q[unique_word] = occurrences

    except KeyboardInterrupt:
        print ('Interrupt detected: aborting...\n'
              ('Failed to complete indexing, ran for %.2fs' % (time() - start)))
        exit(0)

    return q

from pprint import pprint
pprint(invidx('strlist.txt'))

来自普通测试文件的输出:

job completed in 0.00s
{'More': set([1]),
 'This': set([0]),
 'for': set([2]),
 'is': set([0]),
 'lunch': set([2]),
 'some': set([0]),
 'text': set([0, 1]),
 'whats': set([2])}