我具有以下xml结构,并且正在尝试将xml数据转换为结构化的pandas数据框。我已经阅读了许多有关同时使用xml.etree.ElementTree和BeautifulSoup进行xml转换的stackoverflow帖子,但是似乎没有一个示例可以处理我想要的不仅是标签,属性或文本,而且实际上是所有标签的示例。
例如,我希望从下面的xml中获得的是像这样的列:
abr_record_last_updated_date,abr_replaced,abn_status,abn_status_from_date,abn
您会在上面的abn中看到实际的文本,但我不确定如何收集全部文本。
<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>
<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>
</Transfer>
我开始研究在每个项目上使用root.iter的方法,但是我无法弄清楚如何使用该逻辑来获取所需的所有列。
import xml.etree.ElementTree as et
root = et.parse('sample.xml').getroot()
dict_new = {}
for each in root.iter('ABN'):
#abr_last_updated_date =
print(each.tag)
print(each.attrib)
print(each.items())
print(each.text)
最终,如果有人可以共享如何遍历每个xml“块”(不确定正确的术语)并获得前几个列,我相信我可以解决其余的问题。
答案 0 :(得分:1)
即使这是XML文件,也可以使用BeautifulSoup的CSS选择器或text
属性:
data = '''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>
<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>
</Transfer>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
z = zip(soup.select('ABR[recordLastUpdatedDate]'),
soup.select('ABR[replaced]'),
soup.select('ABN[status]'),
soup.select('ABN[ABNStatusFromDate]'),
soup.select('ABN'))
for (c1, c2, c3, c4, c5) in z:
print(c1['recordLastUpdatedDate'], c2['replaced'], c3['status'], c4['ABNStatusFromDate'], c5.text.strip())
打印:
20180216 N ACT 19991101 11000000948
20190531 N CAN 20190501 11000002568
答案 1 :(得分:0)
使用BeautifulSoup,您可以提取所有项目。
from bs4 import BeautifulSoup
data='''<?xml version="1.0"?><Transfer error="none" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="BulkExtract.xsd"><TransferInfo><FileSequenceNumber>1</FileSequenceNumber><RecordCount>714100</RecordCount><ExtractTime>2019-06-19T12:22:15</ExtractTime></TransferInfo>
<ABR recordLastUpdatedDate="20180216" replaced="N"><ABN status="ACT" ABNStatusFromDate="19991101">11000000948</ABN><EntityType><EntityTypeInd>PUB</EntityTypeInd><EntityTypeText>Australian Public Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LTD</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2000</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000000948</ASICNumber><GST status="ACT" GSTStatusFromDate="20000701" /><OtherEntity><NonIndividualName type="TRD"><NonIndividualNameText>QBE INSURANCE (INTERNATIONAL) LIMITED</NonIndividualNameText></NonIndividualName></OtherEntity></ABR>
<ABR recordLastUpdatedDate="20190531" replaced="N"><ABN status="CAN" ABNStatusFromDate="20190501">11000002568</ABN><EntityType><EntityTypeInd>PRV</EntityTypeInd><EntityTypeText>Australian Private Company</EntityTypeText></EntityType><MainEntity><NonIndividualName type="MN"><NonIndividualNameText>TOOHEYS PTY LIMITED</NonIndividualNameText></NonIndividualName><BusinessAddress><AddressDetails><State>NSW</State><Postcode>2141</Postcode></AddressDetails></BusinessAddress></MainEntity><ASICNumber ASICNumberType="undetermined">000002568</ASICNumber></ABR>
</Transfer>'''
soup=BeautifulSoup(data,'lxml')
for tag in soup.select('ABN'):
print("Tag:" + str(tag))
print("Tag Text " + tag.text)
for attr in tag.attrs:
print("Attribute name : "+ attr)
print("Attribute value : " + tag[attr])
Tag:<abn abnstatusfromdate="19991101" status="ACT">11000000948</abn>
Tag Text 11000000948
Attribute name : abnstatusfromdate
Attribute value : 19991101
Attribute name : status
Attribute value : ACT
Tag:<abn abnstatusfromdate="20190501" status="CAN">11000002568</abn>
Tag Text 11000002568
Attribute name : abnstatusfromdate
Attribute value : 20190501
Attribute name : status
Attribute value : CAN