Question

我有一个看起来像这样的文本文件（接近1,500,000行，每行不同长度约5-120个字）：

This is a foo bar sentence.
What are you sure a foo bar? or a foo blah blah.
blah blah foo sheep have you any bar?
...

我想搜索包含短语的行（最多10,000行），让我们说foo bar。所以在python中，我写了这个：

import os
cmd = 'grep -m 10,000 "'+frag+'" '+deuroparl + " > grep.tmp"
os.system(cmd)
results = [i for i in open('grep.tmp','r').readlines()]

在没有grep作弊的情况下，这样做的“正确”方法是什么？它会比grep更快（见How does grep run so fast?）吗？有更快的方法吗？

Answer 1

with file('bla.txt') as input:
  for count, line in enumerate(input):
    if count > 10000:
      break
    if re.search('foo bar', line):
      print line

我认为它不会比grep更快，因为当Python是瑞士军刀的时候，它会被优化以完成这项任务。

如果你想使用stdin，你可以剥去第一行，而只是使用sys.stdin代替input。

Answer 2

您可以使用生成器函数来最小化内存使用量：

import re

def matcher(filename, pattern, maxmatches):
    matches = 0
    pattern = re.compile(pattern)
    with open(filename) as fp:
        for line in fp:
            if pattern.match(line):
                matches += 1
                if matches > maxmatches:
                    break
                yield line.strip()

for line in matcher('whatever.txt', 'foo bar', 10000):
    print line

Answer 3

为了略微概括，itertools模块有非常有用的方法来构建具有内存效率的管道式处理流：

from itertools import ifilter

def grepper(lineno, line):
  return "foo bar" in line

result = ifilter(grepper, enumerate(open("yourfile.txt")))

Answer 4

如果您只搜索特定文本（即不是标题中显示的正则表达式），则：

with open("fileName","r") as fileHandle:
    result = [line.strip() for line in fileHandle if "yourWord" in line]
             # Or use a generator above instead 
print result

什么是os.system的pythonic方式（'grep“一词”file.txt“）？

4 个答案: