如何更加诡异地将由空行分隔的行分组

时间:2017-09-28 20:12:55

标签: python python-3.x

我有一个文件,其中包含重复但唯一命名的文件列表。

例如:

# curl -I -H "Content-type: text/turtle" URL
http_accept = */*
suffix      = DEFAULT, which is actually (blank)

目标是最终得到以下结论:

<md5sum>  /var/www/one.png
<md5sum>  /var/www/one-1.png

<md5sum>  /var/www/two.png
<md5sum>  /var/www/two-1.png
<md5sum>  /var/www/two-2.png

这是我之前运行的命令的输出。现在我需要处理这个输出,我为初学者提出了以下代码:

[
    [
        '/var/www/one.png',
        '/var/www/one-1.png'
    ],
    [
        '/var/www/two.png',
        '/var/www/two-1.png',
        '/var/www/two-2.png'
    ]
]

有没有更简洁的方法来写这个?

3 个答案:

答案 0 :(得分:4)

将整个文件读入变量。使用split("\n\n")将其分隔为重复的组,然后将其与split("\n")分开以获取每一行,最后使用split(" ")拆分每一行。

def process_dupes(dupes_file)
    contents = dupes_file.read()
    groups = [[line.split("  ")[1] for line in group.split("\n") if line != ""] for group in contents.split("\n\n")]

答案 1 :(得分:2)

稍微好一点的版本。当组之间有多个新行

时,也处理这种情况
def get_groups(dupes_file):
    group = []
    for line in dupes_file:
        if line == "\n":
            if group:
                yield group
                group = []
        else:
            md5sum, path = line.split('  ')
            group.append(path.strip())
    if group:
        yield group

输出:

In [61]: with open(DUPES_FILE, 'r') as dupes_file:
    ...:     pprint(list(get_groups(dupes_file)))
    ...:     
    ...:     
[['/var/www/one.png\n', '/var/www/one-1.png\n'],
 ['/var/www/two.png\n', '/var/www/two-1.png\n', '/var/www/two-2.png\n']]

如果这令人困惑,那么对您的版本进行一项改进就是忽略删除索引变量并使用-1,因为您总是希望添加到最后一个列表。

def process_dupes(dupes_file):
    groups = [[]]
    for line in dupes_file:
        if line != '\n':
            path = line.split('  ')[1]
            groups[-1].append(path)
        else:
            groups.append([])

    pprint(groups)

答案 2 :(得分:0)

以下将迭代地处理文件中的数据,而不是首先将整个内容读入内存:

from itertools import groupby
from pprint import pprint

DUPES_FILE = './dupes.txt'

def process_dupes(dupes_file):
    groups = [
        [line.rstrip().split('  ')[1] for line in lines]
            for blank, lines in groupby(dupes_file, lambda line: line == '\n')
                if not blank
    ]
    pprint(groups)

with open(DUPES_FILE, 'r') as dupes_file:
    process_dupes(dupes_file)

输出:

[['/var/www/one.png', '/var/www/one-1.png'],
 ['/var/www/two.png', '/var/www/two-1.png', '/var/www/two-2.png']]