我在解析一个可怕的txt文件时遇到了问题,我设法将所需的信息提取到列表中。
['OS-EXT-SRV-ATTR:host', 'compute-0-4.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'commvault-vsa-vm']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-4.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-00000008']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda']
['hostId', '985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124']
['key_name', '-']
['name', 'Commvault_VSA_VM']
['OS-EXT-SRV-ATTR:host', 'compute-0-28.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'dummy-vm']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-28.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-0000226e']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/hda']
['hostId', '7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22']
['key_name', '-']
['name', 'Dummy_VM']
['OS-EXT-SRV-ATTR:host', 'compute-0-20.domain.tld']
['OS-EXT-SRV-ATTR:hostname', 'mavtel-sif-vsifarvl11']
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-20.domain.tld']
['OS-EXT-SRV-ATTR:instance_name', 'instance-00001da6']
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda']
['hostId', 'dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07']
['key_name', 'mav_tel_key']
['name', 'MAVTEL-SIF-vsifarvl11']
我想让元素0作为标题,而元素1具有行,例如:
OS-EXT-SRV-ATTR:host, OS-EXT-SRV-ATTR:hostname,...., name
compute-0-4.domain.tld, commvault-vsa-vm,....., Commvault_VSA_VM
compute-0-28.domain.tld, dummy-vm,...., Dummy_VM
到目前为止,这是我的代码:
import re
with open('metadata.txt', 'r') as infile:
lines = infile.readlines()
for line in lines:
if re.search('hostId|properties|OS-EXT-SRV-ATTR:host|OS-EXT-SRV-ATTR:hypervisor_hostname|name', line):
re.sub("[\t]+", " ", line)
find = line.strip()
format = ''.join(line.split()).replace('|', ',')
list = format.split(',')
new_list = list[1:-1]
我是python的新手,所以有时我对如何使事情正常工作的想法不多。
答案 0 :(得分:1)
您可以通过跟踪标题和文本文件中的每个条目来逐步构建2D数组。
headers = list(set([entry[0] for entry in data])) # obtain unique headers
num_rows = 1
for entry in data: # figuring out how many rows we are going to need
if 'name' in entry: # name is unique per row so using that
num_rows += 1
num_cols = len(headers)
mat = [[0 for _ in range(num_cols)] for _ in range(num_rows)]
mat[0] = headers # add headers as first row
header_lookup = {header: i for i, header in enumerate(headers)}
row = 1
for entry in data:
header, val = entry[0], entry[1]
col = header_lookup[header]
mat[row][col] = val # add entries to each subsequent row
if header == 'name':
row += 1
print mat
输出:
[['hostId', 'OS-EXT-SRV-ATTR:host', 'name', 'OS-EXT-SRV-ATTR:hostname', 'OS-EXT-SRV-ATTR:instance_name', 'OS-EXT-SRV-ATTR:root_device_name', 'OS-EXT-SRV-ATTR:hypervisor_hostname', 'key_name'], ['985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124', 'compute-0-4.domain.tld', 'Commvault_VSA_VM', 'commvault-vsa-vm', 'instance-00000008', '/dev/vda', 'compute-0-4.domain.tld', '-'], ['7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22', 'compute-0-28.domain.tld', 'Dummy_VM', 'dummy-vm', 'instance-0000226e', '/dev/hda', 'compute-0-28.domain.tld', '-'], ['dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07', 'compute-0-20.domain.tld', 'MAVTEL-SIF-vsifarvl11', 'mavtel-sif-vsifarvl11', 'instance-00001da6', '/dev/vda', 'compute-0-20.domain.tld', 'mav_tel_key']]
如果您需要将新的2D数组写入文件,以使其不那么“可怕”:)
with open('output.txt', 'w') as f:
for lines in mat:
lines_out = '\t'.join(lines)
f.write(lines_out)
f.write('\n')
答案 1 :(得分:1)
查看您的输入文件,我发现它包含似乎从openstack nova show
命令输出的内容,并与其他内容混合在一起。基本上有两种类型的行:有效的行和无效的行(duh)。
有效的具有以下结构:
'| key | value |'
和无效的还有其他。
所以我们可以定义每个有效行
|
上精确地分为 四个部分,其中Python可以做到这一点(称为拆包任务):
a, b, c, d = [1, 2, 3, 4]
a, b, c, d = some_string.split('|')
,当右侧恰好有四个部分时,它将成功,否则将失败,并带有ValueError
。现在,当我们确保a
和d
为空,并且b
和c
不为空时-我们有一个有效的行。
此外,我们可以说,如果b
等于'Property'
,而c
等于'Value'
,那么我们已经到达标题行,随后必须描述“新记录”。
此功能正是这样做的:
def parse_metadata_file(path):
""" parses a data file generated by `nova show` into records """
with open(path, 'r', encoding='utf8') as file:
record = {}
for line in file:
try:
# unpack line into 4 fields: "| key | val |"
a, key, val, z = map(str.strip, line.split('|'))
if a != '' or z != '' or key == '' or val == '':
continue
except ValueError:
# skip invalid lines
continue
if key == 'Property' and val == 'Value' and record:
# output current record and start a new one
yield record
record = {}
else:
# write property to current record
record[key] = val
# output last record
if record:
yield record
它为找到的每条记录吐出一个新的dict,并忽略所有未通过健全性检查的行。有效地,此功能生成一堆字典。
现在,我们可以使用csv
模块将此字典流写入CSV文件:
import csv
# list of fields we are interested in
fields = ['hostId', 'properties', 'OS-EXT-SRV-ATTR:host', 'OS-EXT-SRV-ATTR:hypervisor_hostname', 'name']
with open('output.csv', 'w', encoding='utf8', newline='') as outfile:
writer = csv.DictWriter(outfile, fieldnames=fields, extrasaction='ignore')
writer.writeheader()
writer.writerows(parse_metadata_file('metadata.txt'))
CSV模块有一个DictWriter
,该模块设计为接受dict作为输入并将它们(根据给定的键名)写入CSV行。
extrasaction='ignore'
,当前记录中的字段是否超出要求都没关系fields
列表,提取一组不同的字段变得非常容易。此:
writer.writerows(parse_metadata_file('metadata.txt'))
是一个方便的简写
for record in parse_metadata_file('metadata.txt'):
writer.writerow(record)
答案 2 :(得分:0)
看起来像是熊猫的工作:
import pandas as pd
list_to_export = [['OS-EXT-SRV-ATTR:host', 'compute-0-4.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'commvault-vsa-vm'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-4.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-00000008'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda'],
['hostId', '985035a85d3c98137796f5799341fb65df21e8893fd988ac91a03124'],
['key_name', '-'],
['name', 'Commvault_VSA_VM'],
['OS-EXT-SRV-ATTR:host', 'compute-0-28.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'dummy-vm'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-28.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-0000226e'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/hda'],
['hostId', '7bd08d963a7c598f274ce8af2fa4f7beb4a66b98689cc7cdc5a6ef22'],
['key_name', '-'],
['name', 'Dummy_VM'],
['OS-EXT-SRV-ATTR:host', 'compute-0-20.domain.tld'],
['OS-EXT-SRV-ATTR:hostname', 'mavtel-sif-vsifarvl11'],
['OS-EXT-SRV-ATTR:hypervisor_hostname', 'compute-0-20.domain.tld'],
['OS-EXT-SRV-ATTR:instance_name', 'instance-00001da6'],
['OS-EXT-SRV-ATTR:root_device_name', '/dev/vda'],
['hostId', 'dd82c20a014e05fcfb3d4bcf653c30fa539a8fd4e946760ee1cc6f07'],
['key_name', 'mav_tel_key'],
['name', 'MAVTEL-SIF-vsifarvl11']]
data_dict = {}
for i in list_to_export:
if i[0] not in data_dict:
data_dict[i[0]] = [i[1]]
else:
data_dict[i[0]].append(i[1])
pd.DataFrame.from_dict(data_dict, orient = 'index').T.to_csv('filename.csv')