我有一个文本文件,其中每行都有一堆文本。 (在实际文件中没有行号),如下所示:
line#: text: 0 This is some text 1 More text 2 whats for lunch
我想要一个函数,它返回一个字典,将每个单词映射到它的行号,主要是设计一个反向索引。
即。 {'This':{1}, 'text':{0,1}, 'for':{2} ... }
扫描文本文件后(这需要.18秒)我将这些行放入列表列表中,这样列表中的每个位置都会存储拆分行。即:
[['This', 'is', 'some', 'text'], ['More', ...] ...]
之后我使用enumerate()
提取位置并创建字典。我已经有一个解决方案,但它太丑了,我花了很长时间才想看到另一个更优雅的解决方案。
作为参考,我的算法运行882.28秒,即1099行和753210字的15分钟。换句话说,绝对不是pythonic。
def invidx(strlist):
# return algoritm execution time
start = time.time()
f = open(strlist, 'r')
wordLoc = []
for line in f:
s = line.split()
wordLoc.append(list(s))
f.close()
# benchmark
print 'job completed in %.2fs' % (time.time() - start)
try:
q = {}
for a, b in enumerate(wordLoc):
l = set()
for w in b :
if w not in q:
l = {a for a, b in enumerate(wordLoc) if w in b}
q[w] = l
except KeyboardInterrupt:
print 'Interrupt detected: aborting...'
print 'Failed to complete indexing, ran for %.2fs' % \
(time.time() - start)
exit(0)
return q
编辑:
根据请求代码在上面。对我们说好话。
答案 0 :(得分:3)
您可以在最初扫描文件时使用enumerate
获取行号,并在您前往set
的dict中添加行号。
myfile.txt的:
a b c
b x y
a c b
索引它:
index = {}
with open('myfile.txt') as F:
for line_num, line in enumerate(F):
for word in line.split():
index.setdefault(word, set()).add(line_num)
index
=> {'a': set([0, 2]),
'b': set([0, 1, 2]),
'c': set([0, 2]),
'x': set([1]),
'y': set([1])}
答案 1 :(得分:2)
导致经济放缓的原因是:
l = {a for a, b in enumerate(wordLoc) if w in b}
每当你找到一个你还没有看过的单词时,你重新枚举每一行,看看是否包含单词。这将总体上贡献O(NumberOfUniqueWords * NumberOfLines)操作,这是输入大小的二次方。
你已经列举了每一行的每一个字。为什么不随便添加它们呢?
for w in b :
if w not in q: q[w] = []
q[w].append(a)
这应该是O(NumberOfWords)时间,它是输入大小的线性而不是二次(ish)。你触摸每一个东西,而不是每个独特的单词。
答案 2 :(得分:1)
您可以使用collections.defaultdict
:
from collections import defaultdict
dic = defaultdict(set)
with open('abc') as f:
for i,line in enumerate(f): #enumerate returns the line number as well as the line
words = line.split() #splt the line using str.split()
for word in words: #iterate over words and add to it's corresponding set
dic[word.lower()].add(i)
print dic
<强>输出:强>
defaultdict(<type 'set'>,
{'whats': set([2]),
'for': set([2]),
'this': set([0]),
'text': set([0, 1]),
'is': set([0]),
'some': set([0]),
'lunch': set([2]),
'more': set([1])})
答案 3 :(得分:0)
这似乎有效,我相信它比你的版本更快:
from time import time
def invidx(strlist):
# return algoritm execution time
start = time()
wordLocs = []
unique_words = set()
with open(strlist, 'r') as f:
for line in f:
words = line.split()
unique_words.update(words)
wordLocs.append(set(words))
# benchmark
print 'job completed in %.2fs' % (time() - start)
try:
q = {}
for unique_word in unique_words:
occurrences = set()
for line, words in enumerate(wordLocs):
if unique_word in words:
occurrences.add(line)
q[unique_word] = occurrences
except KeyboardInterrupt:
print ('Interrupt detected: aborting...\n'
('Failed to complete indexing, ran for %.2fs' % (time() - start)))
exit(0)
return q
from pprint import pprint
pprint(invidx('strlist.txt'))
来自普通测试文件的输出:
job completed in 0.00s
{'More': set([1]),
'This': set([0]),
'for': set([2]),
'is': set([0]),
'lunch': set([2]),
'some': set([0]),
'text': set([0, 1]),
'whats': set([2])}