Question

我正在处理大文本文件（10 MB gziped）。总共有两个文件属于一起，长度和结构相同：每个数据集4行。

我需要同时处理来自两个文件的每个4个块中第2行的数据。

我的问题：什么是最节省时间的方法？

现在我正在这样做：

def read_groupwise(iterable, n, fillvalue=None):
    args = [iter(iterable)] * n
    return itertools.izip_longest(fillvalue=fillvalue, *args)

f1 = gzip.open(file1,"r")
f2 = gzip.open(file2,"r")
for (fline1,fline2,fline3,fline4), (rline1, rline2, rline3, rline4) in zip(read_groupwise(f1, 4), read_groupwise(f2, 4)):
    # process fline2, rline2

但是因为我只需要每个第2行，所以我猜这可能是一种更有效的方法吗？

Answer 1

这可以通过构建自己的生成器来完成：

def get_nth(iterable, n, after=1):
    if after > 1:
        consume(iterable, after-1)
    while True:
        yield next(iterable)
        consume(iterable, n-1)

with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2:
    every = (4, 2)
    for line_f1, line_f2 in zip(get_nth(f1, *every), get_nth(f2, *every)):
        ...

生成器前进到要给定的第一个项目（在这种情况下，我们想要第二个项目，所以我们跳过一个将迭代器放在第二个项目之前），然后生成一个值，然后前进到自己的位置下一个项目。这是完成手头任务的一种非常简单的方法。

此处使用consume() from itertools' recipes：

def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)

作为最后一点，我不确定gzip.open()是否提供了一个上下文管理器，如果没有，你会想要使用contextlib.closing()。

Answer 2

我建议立即使用itertools.izip_longest来压缩文件和itertools.islice的内容以从第2行开始选择每个第四个元素

>>> def get_nth(iterable, n, after=1, fillvalue = ""):
    return islice(izip_longest(*iterable,fillvalue=fillvalue), n, None, after)

>>> with gzip.open(file1, "r") as f1, gzip.open(file2, "r") as f2:
    for line in get_nth([f1, f2], n = 2):
        print map(str.strip, line)

Answer 3

如果你有记忆，那么试试：

ln1 = f1.readlines()[2::4]
ln2 = f2.readlines()[2::4]
for fline, rline in zip(ln1, ln2):
    ...

但只有你有记忆。

同时从2个文件中读取每4行

3 个答案: