我有一个类似的文件(foo.txt)(第0列被分组):
1 foo bar
1 lorem ipsum gypsum
1 baba loo too
2 hello goodbye seeya
3 kobe magic wilt
3 foo sneaks bar
3 more stuff
3 last line in file
如何以line.split()[0]
的块为单位迭代文件?我知道发电机可以做到这一点,但我不完全确定如何。基本上,我想这样做:
def first_column_grouping(file):
yield some_list ## How?
with open("foo.txt") as file:
for group in first_column_grouping(file): ## 3 values
print group
预期产出:
["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]
["2 hello goodbye seeya"]
["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]
答案 0 :(得分:2)
因此,您实际上需要itertools.groupby
提供的功能。如果您的第一列已排序,这将有效:
>>> from itertools import groupby
>>> from operator import itemgetter
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(g))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
如果您想稍微清理一下该输出,可以将str.split
映射到您的论坛:
>>> with io.StringIO(s) as f:
... for k, g in groupby(f, itemgetter(0)):
... print(list(map(str.strip, g)))
...
['1 foo bar', '1 lorem ipsum gypsum', '1 baba loo too']
['2 hello goodbye seeya']
['3 kobe magic wilt', '3 foo sneaks bar', '3 more stuff', '3 last line in file']
如果你想从头开始实现这一点,一个不灵活且天真的生成器可能看起来像这样:
>>> def groupby_first_column(f):
... line = next(f)
... k = line[0]
... group = [line]
... for line in f:
... if line[0] == k:
... group.append(line)
... else:
... yield group
... group = [line]
... k = line[0]
... yield group
...
>>> with io.StringIO(s) as f:
... for group in groupby_first_column(f):
... print(list(group))
...
['1 foo bar\n', '1 lorem ipsum gypsum\n', '1 baba loo too\n']
['2 hello goodbye seeya\n']
['3 kobe magic wilt\n', '3 foo sneaks bar\n', '3 more stuff\n', '3 last line in file']
>>>
警告上述生成器仅在每一行的第一列完全位于第一个位置时才有效,且只有1个字符长。这并不意味着非常有用,只是为了说明这个想法。如果你想自己动手,你必须要更加透彻
答案 1 :(得分:1)
这是一种变体(fake_file
这里只是file
语句中的with
:
from io import StringIO
fake_file = StringIO('''1 foo bar
1 lorem ipsum gypsum
1 baba loo too
2 hello goodbye seeya
3 kobe magic wilt
3 foo sneaks bar
3 more stuff
3 last line in file''')
def iter_cols(file):
lne = next(file).strip()
buffer = [lne]
last_number = lne.split()[0]
for line in file:
lne = line.strip()
number = lne.split()[0]
if number != last_number:
yield buffer
buffer = [lne]
last_number = number
else:
buffer.append(lne)
yield buffer
for cols in iter_cols(fake_file):
print(cols)
这会迭代文件,不需要将整个文件放在内存中。因此,只有相邻的行将被分组。
(你似乎在使用python2:file
当时不是一个好的变量名 - 因为它是内置的)
答案 2 :(得分:1)
这是itertools.groupby
的用途,但我认为你需要将整个文件读入内存才能做到这一点。
import itertools
with open("path/to/file") as f:
data = f.readlines() # a list of the lines of the file
groups = itertools.groupby(data, key=lambda line: line.split()[0])
# group on the first column of each line. This produces something like:
# [ ("1", ["1 foo bar", "1 lorem ipsum gypsum", "1 baba loo too"]),
# ("2", ["2 hello goodbye seeya"]),
# ("3", ["3 kobe magic wilt", 3 foo sneaks bar", "3 more stuff", "3 last line in file"]) ]
# since you only want the values there, just pull them out of the tuples
result = [v for k,v in groups]
但是我老实说不确定groupby
是否一次消耗所有数据。如果它是一个懒惰的迭代器,你可以直接传递f
。
import itertools
import operator
with open('path/to/file') as f:
groups = itertools.groupby(f, key=lambda line: line.split()[0])
for _, group in groups:
result = list(group)
# use this result however you like, but...
# be sure not to leave this block until you've consumed all of
# result, or you won't be able to read any more of the file.
如果您不能或不想一次将文件读入内存,您将需要做一些特别的事情。
def group_by_col(filename, key=None):
if key is None:
key = lambda s: s
with open(filename) as f:
cur_group = []
grouper = []
for line in file:
new_grouper = key(line)
if new_grouper != grouper:
if cur_group:
yield cur_group
cur_group = [line]
grouper = new_grouper
else:
cur_group.append(line.rstrip())
yield cur_group
在这种情况下,您必须传递键功能以选择每行的第一个以空格分隔的列:例如lambda s: s.split()[0]
for group in group_by_col('path/to/file', key=lambda s: s.split()[0]):
print(group)
答案 3 :(得分:0)
这是基于接受的答案,并将按任何指定的列进行分组:
def group_by_column(f, column):
line = next(f)
k = line.split()[column]
group = [line]
for line in f:
if line.split()[column] == k:
group.append(line)
else:
yield group
group = [line]
k = line.split()[column]
yield group
if __name__ == "__main__":
foo = "foo.txt"
with open(foo) as foofile:
for group in group_by_column(foofile, 0):
print(group)