在大文件中查找不符合内存的字符串

时间:2013-07-23 05:21:07

标签: python bigdata

我被要求在10GB大且有1GB RAM的大文件中找到字符串“And”的出现次数。我该如何有效地做到这一点。我回答说我们需要以每个100MB的内存块读取文件,然后在每个内存块中找到“And”的总出现次数,并保持字符串“And”的累积计数。面试官对我的回答不满意,他告诉我grep命令如何在unix中运行。编写类似于python中的代码,但我不知道答案。我很感激回答这个问题。

2 个答案:

答案 0 :(得分:5)

迭代文件,返回行。在这种情况下,它很容易,因为搜索字符串不包含行尾字符,所以我们不需要担心跨越行的匹配。

with open("file.txt") as fin:
    print sum(line.count('And') for line in fin)

在每一行使用str.count

>>> help(str.count)
Help on method_descriptor:

count(...)
    S.count(sub[, start[, end]]) -> int

    Return the number of non-overlapping occurrences of substring sub in
    string S[start:end].  Optional arguments start and end are interpreted
    as in slice notation.

答案 1 :(得分:4)

如果您使用generators,则可以访问大文件并进行处理。

简单的grep命令,

def command(f):
    def g(filenames, **kwa):
        lines = readfiles(filenames)
        lines = (outline for line in lines for outline in f(line, **kwa))
        # lines = (line for line in lines if line is not None)
        printlines(lines)
    return g

def readfiles(filenames):
    for f in filenames:
        for line in open(f):
            yield line


def printlines(lines):
    for line in lines:
            print line.strip("\n")

@command
def grep(line, pattern):
    if pattern in line:
        yield line


if __name__ == '__main__':
    import sys
    pattern = sys.argv[1]
    filenames = sys.argv[2:]
    grep(filenames, pattern=pattern)