我有一个文件,其中包含重复但唯一命名的文件列表。
例如:
# curl -I -H "Content-type: text/turtle" URL
http_accept = */*
suffix = DEFAULT, which is actually (blank)
目标是最终得到以下结论:
<md5sum> /var/www/one.png
<md5sum> /var/www/one-1.png
<md5sum> /var/www/two.png
<md5sum> /var/www/two-1.png
<md5sum> /var/www/two-2.png
这是我之前运行的命令的输出。现在我需要处理这个输出,我为初学者提出了以下代码:
[
[
'/var/www/one.png',
'/var/www/one-1.png'
],
[
'/var/www/two.png',
'/var/www/two-1.png',
'/var/www/two-2.png'
]
]
有没有更简洁的方法来写这个?
答案 0 :(得分:4)
将整个文件读入变量。使用split("\n\n")
将其分隔为重复的组,然后将其与split("\n")
分开以获取每一行,最后使用split(" ")
拆分每一行。
def process_dupes(dupes_file)
contents = dupes_file.read()
groups = [[line.split(" ")[1] for line in group.split("\n") if line != ""] for group in contents.split("\n\n")]
答案 1 :(得分:2)
稍微好一点的版本。当组之间有多个新行
时,也处理这种情况def get_groups(dupes_file):
group = []
for line in dupes_file:
if line == "\n":
if group:
yield group
group = []
else:
md5sum, path = line.split(' ')
group.append(path.strip())
if group:
yield group
输出:
In [61]: with open(DUPES_FILE, 'r') as dupes_file:
...: pprint(list(get_groups(dupes_file)))
...:
...:
[['/var/www/one.png\n', '/var/www/one-1.png\n'],
['/var/www/two.png\n', '/var/www/two-1.png\n', '/var/www/two-2.png\n']]
如果这令人困惑,那么对您的版本进行一项改进就是忽略删除索引变量并使用-1
,因为您总是希望添加到最后一个列表。
def process_dupes(dupes_file):
groups = [[]]
for line in dupes_file:
if line != '\n':
path = line.split(' ')[1]
groups[-1].append(path)
else:
groups.append([])
pprint(groups)
答案 2 :(得分:0)
以下将迭代地处理文件中的数据,而不是首先将整个内容读入内存:
from itertools import groupby
from pprint import pprint
DUPES_FILE = './dupes.txt'
def process_dupes(dupes_file):
groups = [
[line.rstrip().split(' ')[1] for line in lines]
for blank, lines in groupby(dupes_file, lambda line: line == '\n')
if not blank
]
pprint(groups)
with open(DUPES_FILE, 'r') as dupes_file:
process_dupes(dupes_file)
输出:
[['/var/www/one.png', '/var/www/one-1.png'],
['/var/www/two.png', '/var/www/two-1.png', '/var/www/two-2.png']]