Question

我有一个文件，由换行符分隔成相同行数的块。每行都是一个字段。例如，在chunk1中，第一个字段= a1，a2，a3。在chunk2中，相同的字段= a2，a3，a4。

a1,a2,a3
b1
c1,c2,c3,c4
d1
e1

a2,a3,a4
b2
c3,c4
d2
e2

a3,a5
b3
c4,c6
d3
e3

如何获得如下所示的数据框（或其他数据结构）？

    f1        f2       f3            f4  f5 
    a1,a2,a3  b1       c1,c2,c3,c4   d1  e1
    a2,a3,a4  b2       c3,c4         d2  e2
    a3,a5     b3       c4,c6         d3  e3

谢谢！

Answer 1

打开文件是行的迭代器。你想要一组行的迭代器。

由于所有这些组都是6行（计算结尾处的空行），最简单的方法是使用文档中itertools recipes的grouper示例。（如果您愿意，也可以从PyPI上的more-itertools库中获取预制版本。）

from itertools import *

def grouper(iterable, n, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

with open(path) as f:
    for group in grouper(f, 6):
        do_something(group)

如果您的组的长度未提前知道（即使它在文件中始终保持一致），您也可以使用groupby创建空行和非空行的交替组。这有点像在字符串上使用split。

我们可以在这里使用bool作为关键函数 - 非空行是真实的，空行是假的。（如果你觉得这很奇怪，你可以写一些像lambda line: line或lambda line: line != ''的东西。）

with open(path) as f:
    for nonempty, group in groupby(f, bool):
        if nonempty:
            do_something(group)

或者，如果这似乎超出了你的想法...好吧，首先阅读David Beazley的Generator Tricks for Systems Programmers，也许它将不再是你的头脑。但如果是的话，我们可以更明确地做同样的事情：

with open(path) as f: group = [] for line in f: if line: group.append(line) else: do_something(group) group = [] if group: do_something(group)

Answer 2

如果您可以使用pandas并知道有多少字段：

fields = 5
df = pd.read_table('data.txt', header=None)
df = pd.DataFrame(df.values.reshape(-1, fields)))

如果不知道有多少字段：

df = (pd
      .read_table('data.txt', header=None, skip_blank_lines=False)
      .append([np.nan]))
# empty lines become NaN. Find the first of them.
fields = np.where(pd.isnull(f))[0][0]
df = pd.DataFrame(df.values.reshape(-1, fields + 1)))
del df[df.columns[-1]]  # delete the NaN column

Answer 3

你可以试试发电机方法：

def chunks_by_space(file):
    with open(file,'r') as f:
        data=[line.strip() for line in f.readlines()]
        store=[]

        for line_no,value in enumerate(data):
            if value=='':
                yield store
                store=[]
            else:
                store.append(value)
        yield store

gen=chunks_by_space('file_name')
print(list(zip(*gen)))

输出：

[('a1,a2,a3', 'a2,a3,a4', 'a3,a5'), ('b1', 'b2', 'b3'), ('c1,c2,c3,c4', 'c3,c4', 'c4,c6'), ('d1', 'd2', 'd3'), ('e1', 'e2', 'e3')]

由Newline进入块的Python分割文件

3 个答案: