python / BeautifulSoup使用xml2csv

时间:2018-09-20 07:48:55

标签: python beautifulsoup

我需要将XML文件转换为CSV,我使用BeautifulSoup进行了一些尝试,但是我几乎没有Python知识。我有点累,但是距离我还远。

#!/usr/bin/env python3
from bs4 import BeautifulSoup

# infile = open("test.xml","r")
# data = infile.read()
data = """<ROOT>
<SUB1>
  <credential id='3' name='somename' host='somehost' username='someusername' info='someinfo' type='LDAP' opSys='' url='3423' email='33454'>
    <notes />
    <label id='1' />
  </credential>
<credential id='3' name='somename2' host='somehost2' username='someusername2' info='someinfo2' type='LDAP' opSys='' url='12' email='34'>
    <notes>some note
asdasdasd </notes>
      <label id='4' />
    </credential>
  </SUB1>
</ROOT>"""
soup = BeautifulSoup(data,'xml')

# export as csv, should look like this
# name;host;username;info;type;url;email;notes;label
# "somename";"somehost";"someusername";"someinfo";LDAP;"3423;"33454";"";1
# "somenane2";"somehost2";"someusername2";"someinfo2";LDAP;"12";"34";"somenote asdasdasd";4


print("name;host;username;info;type;url;email;notes;label")



notes = soup.find_all('notes')
for notes in notes:
    if notes == "":
        print("")
    else:
        notes = notes.get_text('\n', '')
        print(notes)

有人可以给一些提示吗?

1 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup同时返回所有想要的元素。然后,您可以使用elem.name确定返回哪一个,并为每一行建立必要的信息。 label排在最后,您可以用它来写输出行。

Python有一个CSV library可以帮助您。它采用值列表,并将其正确格式化为输出行。通常,您不需要对所有内容都加上引号,但是由于您的预期输出具有此功能,因此可以添加quoting=csv.QUOTE_ALL强制其将其添加到所有值。

itemgetter()只是一个有用的函数,可以从Python数据中提取require元素,例如一次调用中的列表或字典。

from bs4 import BeautifulSoup
from operator import itemgetter
import csv

credential_fields = ['name', 'host', 'username', 'info', 'type', 'url', 'email']
get_fields = itemgetter(*credential_fields)

with open('test.xml') as f_input:
    soup = BeautifulSoup(f_input, 'xml')

with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=';', quoting=csv.QUOTE_ALL)
    csv_output.writerow(credential_fields + ['notes', 'label'])

    for elem in soup.find_all(['notes', 'credential', 'label']):
        if elem.name == 'notes':
            note = elem.get_text(strip=True).replace('\n', ' ')
        elif elem.name == 'credential':
            credential = list(get_fields(elem))
        else:
            label = elem['id']
            csv_output.writerow(credential + [note, label])

这将为您提供输出格式:

"name";"host";"username";"info";"type";"url";"email";"notes";"label"
"somename";"somehost";"someusername";"someinfo";"LDAP";"3423";"33454";"";"1"
"somename2";"somehost2";"someusername2";"someinfo2";"LDAP";"12";"34";"some noteasdasdasd";"4"