我有一个文件,其中每一行按字母顺序排序。该文件是12Gb,这意味着我不能简单地逐行读取它。数据如下所示:
brown 0 1 0 1 2
fox 3 5 0 0 1
jumped 2 0 6 1 0
每行开头的单词都是唯一的。每行上的单词和数字由制表符分隔。我希望能够在文件中查询特定的关键字。例如,如果我查询“fox”,程序应返回“fox 3 5 0 0 1”。
似乎一个很好的候选者是bisect模块:https://docs.python.org/3.0/library/bisect.html
我发现了一个使用bisect查找关键字行号的帖子:How do I perform binary search on a text file to search a keyword in python?
这就是代码的样子:
import bisect
import os
class Query(object):
def __init__(self, query, index=5):
self.query = query
self.index = index
def __lt__(self, comparable):
return self.query < comparable[self.index:]
class FileSearcher(object):
def __init__(self, file_pointer, record_size=35):
self.file_pointer = file_pointer
self.file_pointer.seek(0, os.SEEK_END)
self.record_size = record_size + len(os.linesep)
self.num_bytes = self.file_pointer.tell()
self.file_size = (self.num_bytes // self.record_size)
def __len__(self):
return self.file_size
def __getitem__(self, item):
self.file_pointer.seek(item * self.record_size)
return self.file_pointer.read(self.record_size)
with open('myfile') as file_to_search:
query = 'fox\t' #token to query
wrapped_query = Query(query)
searchable_file = FileSearcher(file_to_search)
linepos = bisect.bisect(searchable_file, wrapped_query)
print "Located @ line: ", linepos
#print content of line?
但是,我无法弄清楚如何实际打印该行的内容。我至少应该在某个地方添加一个读取声明,但我不知道在哪里。
是否可以使用bisect模块打印行的内容?
答案 0 :(得分:0)
如果您想使用Python解决方案,可以执行以下操作:
MAX_LINE
个字节块读取文件,每次按固定偏移量向前移动以下是示例文件bigfile
:
abc 4
bar 2
baz 3
egg 6
foo 1
god 8
ham 5
sex 7
代码:
import os
from bisect import bisect
MAX_LINE = 7
BLOCK_SIZE = 10
def parse_chunks(filename):
size = os.path.getsize(filename)
chunks = []
with open(filename, 'rb') as file:
block = str(file.read(MAX_LINE*2))
first_line = block[:block.find('\n') + 1]
chunks.append(first_line.split()[0])
pos = BLOCK_SIZE
while pos < size:
file.seek(pos)
block = str(file.read(MAX_LINE*2))
first_eol = block.find('\n')
second_eol = block.find('\n', first_eol + 1)
if first_eol == -1 or second_eol == -1:
break
line = block[first_eol + 1:second_eol]
key = line.split()[0]
chunks.append(key)
pos += BLOCK_SIZE
return chunks
if __name__ == '__main__':
BLOCK_SIZE = 10
filename = 'bigfile'
chunks = parse_chunks(filename)
query = 'abc'
pos_before = bisect(chunks, query) - 1
with open(filename, 'rb') as file:
file.seek(pos_before*BLOCK_SIZE)
block = str(file.read(BLOCK_SIZE + MAX_LINE))
line_start = block.find(query)
line_end = block.find('\n', line_start + 1)
line = block[line_start:line_end]
print(line)
在这个玩具示例中,我使用10字节的块大小,在你的12GB文件的情况下,我建议你从1M开始。
答案 1 :(得分:0)
以下递归函数应该能够缩小搜索间隔。我不确定您是否可以修改它以使其返回匹配项或None
表示不匹配。
def bisearch(f, word, i, j)
if (j-1)<1E6: return i,j
k = (i+j)/2
f.seek(k)
while k<j:
c = f.read(1)
k = k+1
if c == '\n': break
else:
# ??? no match ??? I'm not sure
w = []
while 1:
c = f.read(1)
if c == '\t': break
w.append(c)
w = "".join(w)
if w == word:
return k, k
if w < word:
return bisearch(f, word, k, j)
else:
return bisearch(f, word, i, k)
这里是一个使用示例
word = ...
f = open(...)
i,j = bisearch(f, word, 0, len_f)
f.seek(i)
if i==j:
line = f.readline()
else:
#################### EDIT ################
# OLD
# buffer = f.read(1E6)
# NEW
buffer = f.read(j-i)
lenw = len(word)
for line in buffer.split('\n'):
if line[:lenw] == word: break
else:
# no matches, SOS
result = process(line)
答案 2 :(得分:-1)
使用seek
尝试readline
相关行。
print "Located @ line: ", linepos
file_to_search.seek(linepos)
line = file_to_search.readline()
这假设linepos
是行的位置,以文件开头的字节数计算。如果它是以行号计算的位置,则在搜索之前,您需要乘以每行的字节数。
print "Located @ line: ", linepos
file_to_search.seek(linepos * searchable_file.record_size)
line = file_to_search.readline()