我需要将XML文件转换为CSV,我使用BeautifulSoup进行了一些尝试,但是我几乎没有Python知识。我有点累,但是距离我还远。
#!/usr/bin/env python3
from bs4 import BeautifulSoup
# infile = open("test.xml","r")
# data = infile.read()
data = """<ROOT>
<SUB1>
<credential id='3' name='somename' host='somehost' username='someusername' info='someinfo' type='LDAP' opSys='' url='3423' email='33454'>
<notes />
<label id='1' />
</credential>
<credential id='3' name='somename2' host='somehost2' username='someusername2' info='someinfo2' type='LDAP' opSys='' url='12' email='34'>
<notes>some note
asdasdasd </notes>
<label id='4' />
</credential>
</SUB1>
</ROOT>"""
soup = BeautifulSoup(data,'xml')
# export as csv, should look like this
# name;host;username;info;type;url;email;notes;label
# "somename";"somehost";"someusername";"someinfo";LDAP;"3423;"33454";"";1
# "somenane2";"somehost2";"someusername2";"someinfo2";LDAP;"12";"34";"somenote asdasdasd";4
print("name;host;username;info;type;url;email;notes;label")
notes = soup.find_all('notes')
for notes in notes:
if notes == "":
print("")
else:
notes = notes.get_text('\n', '')
print(notes)
有人可以给一些提示吗?
答案 0 :(得分:0)
您可以使用BeautifulSoup同时返回所有想要的元素。然后,您可以使用elem.name
确定返回哪一个,并为每一行建立必要的信息。 label
排在最后,您可以用它来写输出行。
Python有一个CSV library可以帮助您。它采用值列表,并将其正确格式化为输出行。通常,您不需要对所有内容都加上引号,但是由于您的预期输出具有此功能,因此可以添加quoting=csv.QUOTE_ALL
强制其将其添加到所有值。
itemgetter()
只是一个有用的函数,可以从Python数据中提取require元素,例如一次调用中的列表或字典。
from bs4 import BeautifulSoup
from operator import itemgetter
import csv
credential_fields = ['name', 'host', 'username', 'info', 'type', 'url', 'email']
get_fields = itemgetter(*credential_fields)
with open('test.xml') as f_input:
soup = BeautifulSoup(f_input, 'xml')
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter=';', quoting=csv.QUOTE_ALL)
csv_output.writerow(credential_fields + ['notes', 'label'])
for elem in soup.find_all(['notes', 'credential', 'label']):
if elem.name == 'notes':
note = elem.get_text(strip=True).replace('\n', ' ')
elif elem.name == 'credential':
credential = list(get_fields(elem))
else:
label = elem['id']
csv_output.writerow(credential + [note, label])
这将为您提供输出格式:
"name";"host";"username";"info";"type";"url";"email";"notes";"label"
"somename";"somehost";"someusername";"someinfo";"LDAP";"3423";"33454";"";"1"
"somename2";"somehost2";"someusername2";"someinfo2";"LDAP";"12";"34";"some noteasdasdasd";"4"