设置
我是xml和ubl xml的新手。
尝试使用ElementTree将以下.xml发票读入Python。
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2
http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
<cbc:UBLVersionID>2.1</cbc:UBLVersionID>
<cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
<cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
<cbc:ID>201909638</cbc:ID>
<cbc:IssueDate>2019-11-01</cbc:IssueDate>
<cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
<cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
<cac:OrderReference>
# other stuff
</Invoice>
如果我运行root[4].text
,则会在IssueDate
标记处得到以字符串返回的文本,即'2019-11-01'
。
问题
我想根据标签的标题获取文本。
root.find('IssueDate').text
root.find('cbc:IssueDate').text
给予AttributeError: 'NoneType' object has no attribute 'text'
。
问题
如何获取基于标签标题IssueDate
的文本?
更一般地说,如何根据标签的标题获取任何标签的文本?
答案 0 :(得分:0)
import xml.etree.ElementTree as ET
xml_string="""<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2
http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
<cbc:UBLVersionID>2.1</cbc:UBLVersionID>
<cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
<cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
<cbc:ID>201909638</cbc:ID>
<cbc:IssueDate>2019-11-01</cbc:IssueDate>
<cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
<cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
<cac:OrderReference>ABC</cac:OrderReference>
</Invoice>"""
root = ET.fromstring(xml_string)
这里我使用了字符串作为输入,您也使用了XML文件。 现在基于标签的标题获取文本首先,您需要知道标签的名称。
for child in root:
print(child.tag, child.attrib)
输出:
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}UBLVersionID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}CustomizationID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}ProfileID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}ID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}InvoiceTypeCode {'listAgencyID': '6', 'listID': 'UNCL1001'}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}DocumentCurrencyCode {'listAgencyID': '6', 'listID': 'ISO 4217 Alpha'}
{urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2}OrderReference {}
您可以看到查找文本的逻辑是正确的,但是标题错误。由于发票属性,在这里我们无法使用 'IssueDate' 直接查找文本 或 'cbc:IssueDate'
如果您使用过
root.find("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate").text
输出:
'2019-11-01'
在 之前,在IssueDate前面添加了
我希望这能回答您的问题。
答案 1 :(得分:0)
您可以使用BeautifulSoup
from bs4 import BeautifulSoup as BS4
xml_test = """<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2
http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
<cbc:UBLVersionID>2.1</cbc:UBLVersionID>
<cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
<cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
<cbc:ID>201909638</cbc:ID>
<cbc:IssueDate>2019-11-01</cbc:IssueDate>
<cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
<cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
<cac:OrderReference>ABC</cac:OrderReference>
</Invoice>"""
soup = BS4(xml_test)
tag = soup.find("cbc:issuedate")
print(tag.text)
结果将是
2019-11-01
如果您有很多issue dates
,则可以使用
tags = soup.findAll("cbc:issuedate")
for tag in tags:
print(tag.text)
我希望对您有帮助
答案 2 :(得分:0)
您也可以使用SimplifiedDoc。
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2
http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
<cbc:UBLVersionID>2.1</cbc:UBLVersionID>
<cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
<cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
<cbc:ID>201909638</cbc:ID>
<cbc:IssueDate>2019-11-01</cbc:IssueDate>
<cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
<cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
<cac:OrderReference>
# other stuff
</Invoice>
'''
doc = SimplifiedDoc(html)
print (doc.getElementByTag('cbc:IssueDate').text) # get one
lst = doc.getElementByTag('Invoice').getChildren() # get all
for item in lst:
print (item.tag,item.text)