Question

我有一个类似的文件（foo.txt）（第0列被分组）：

1  foo     bar
1  lorem   ipsum   gypsum
1  baba    loo     too
2  hello   goodbye seeya
3  kobe    magic   wilt
3  foo     sneaks  bar
3  more    stuff
3  last    line    in      file

如何以line.split()[0]的块为单位迭代文件？我知道发电机可以做到这一点，但我不完全确定如何。基本上，我想这样做：

def first_column_grouping(file):
    yield some_list ## How?

with open("foo.txt") as file:
    for group in first_column_grouping(file): ## 3 values
        print group

预期产出：

["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]
["2 hello goodbye seeya"]
["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]

Answer 1

因此，您实际上需要itertools.groupby提供的功能。如果您的第一列已排序，这将有效：

>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(g))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

如果您想稍微清理一下该输出，可以将str.split映射到您的论坛：

>>> with io.StringIO(s) as f:
...     for k, g in groupby(f, itemgetter(0)):
...         print(list(map(str.strip, g)))
...
['1  foo     bar', '1  lorem   ipsum   gypsum', '1  baba    loo     too']
['2  hello   goodbye seeya']
['3  kobe    magic   wilt', '3  foo     sneaks  bar', '3  more    stuff', '3  last    line    in      file']

如果你想从头开始实现这一点，一个不灵活且天真的生成器可能看起来像这样：

>>> def groupby_first_column(f):
...     line = next(f)
...     k = line[0]
...     group = [line]
...     for line in f:
...         if line[0] == k:
...             group.append(line)
...         else:
...             yield group
...             group = [line]
...             k = line[0]
...     yield group
...
>>> with io.StringIO(s) as f:
...     for group in groupby_first_column(f):
...         print(list(group))
...
['1  foo     bar\n', '1  lorem   ipsum   gypsum\n', '1  baba    loo     too\n']
['2  hello   goodbye seeya\n']
['3  kobe    magic   wilt\n', '3  foo     sneaks  bar\n', '3  more    stuff\n', '3  last    line    in      file']
>>>

警告上述生成器仅在每一行的第一列完全位于第一个位置时才有效，且只有1个字符长。这并不意味着非常有用，只是为了说明这个想法。如果你想自己动手，你必须要更加透彻

Answer 2

这是一种变体（fake_file这里只是file语句中的with：

from io import StringIO

fake_file = StringIO('''1  foo     bar
1  lorem   ipsum   gypsum
1  baba    loo     too
2  hello   goodbye seeya
3  kobe    magic   wilt
3  foo     sneaks  bar
3  more    stuff
3  last    line    in      file''')


def iter_cols(file):

    lne = next(file).strip()
    buffer = [lne]
    last_number = lne.split()[0]

    for line in file:
        lne = line.strip()
        number = lne.split()[0]
        if number != last_number:
            yield buffer
            buffer = [lne]
            last_number = number
        else:
            buffer.append(lne)
    yield buffer

for cols in iter_cols(fake_file):
    print(cols)

这会迭代文件，不需要将整个文件放在内存中。因此，只有相邻的行将被分组。

（你似乎在使用python2：file当时不是一个好的变量名 - 因为它是内置的）

Answer 3

这是itertools.groupby的用途，但我认为你需要将整个文件读入内存才能做到这一点。

import itertools

with open("path/to/file") as f:
    data = f.readlines()  # a list of the lines of the file

groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
#   ("2", ["2 hello goodbye seeya"]),
#   ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]

# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]

但是我老实说不确定groupby是否一次消耗所有数据。如果它是一个懒惰的迭代器，你可以直接传递f。

import itertools
import operator

with open('path/to/file') as f:
    groups = itertools.groupby(f, key=lambda line: line.split()[0])
    for _, group in groups:
        result = list(group)
        # use this result however you like, but...
    # be sure not to leave this block until you've consumed all of
    # result, or you won't be able to read any more of the file.

如果您不能或不想一次将文件读入内存，您将需要做一些特别的事情。

def group_by_col(filename, key=None):
    if key is None:
        key = lambda s: s
    with open(filename) as f:
        cur_group = []
        grouper = []
        for line in file:
            new_grouper = key(line)
            if new_grouper != grouper:
                if cur_group:
                    yield cur_group
                cur_group = [line]
                grouper = new_grouper
            else:
                cur_group.append(line.rstrip())
        yield cur_group

在这种情况下，您必须传递键功能以选择每行的第一个以空格分隔的列：例如lambda s: s.split()[0]

for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
    print(group)

Answer 4

这是基于接受的答案，并将按任何指定的列进行分组：

def group_by_column(f, column):
     line = next(f)
     k = line.split()[column]
     group = [line]
     for line in f:
         if line.split()[column] == k:
             group.append(line)
         else:
             yield group
             group = [line]
             k = line.split()[column]
     yield group


if __name__ == "__main__":

    foo = "foo.txt"
    with open(foo) as foofile:
        for group in group_by_column(foofile, 0):
            print(group)

如何以块的形式迭代文件？

4 个答案: