根据xml标签标题获取文本

时间:2019-12-03 13:15:19

标签: python xml tags

设置

我是xml和ubl xml的新手。

尝试使用ElementTree将以下.xml发票读入Python。

<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2&#xA;http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
  <cbc:UBLVersionID>2.1</cbc:UBLVersionID>
  <cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
  <cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
  <cbc:ID>201909638</cbc:ID>
  <cbc:IssueDate>2019-11-01</cbc:IssueDate>
  <cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
  <cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
  <cac:OrderReference>
  # other stuff
</Invoice>

如果我运行root[4].text,则会在IssueDate标记处得到以字符串返回的文本,即'2019-11-01'


问题

我想根据标签的标题获取文本。

  • root.find('IssueDate').text
  • root.find('cbc:IssueDate').text

给予AttributeError: 'NoneType' object has no attribute 'text'


问题

如何获取基于标签标题IssueDate的文本?

更一般地说,如何根据标签的标题获取任何标签的文本?

3 个答案:

答案 0 :(得分:0)

import xml.etree.ElementTree as ET   

xml_string="""<?xml version="1.0" encoding="UTF-8"?>
    <Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2&#xA;http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
      <cbc:UBLVersionID>2.1</cbc:UBLVersionID>
      <cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
      <cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
      <cbc:ID>201909638</cbc:ID>
      <cbc:IssueDate>2019-11-01</cbc:IssueDate>
      <cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
      <cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
      <cac:OrderReference>ABC</cac:OrderReference>
    </Invoice>"""

root = ET.fromstring(xml_string)

这里我使用了字符串作为输入,您也使用了XML文件。 现在基于标签的标题获取文本首先,您需要知道标签的名称。

for child in root:
    print(child.tag, child.attrib)

输出:

{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}UBLVersionID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}CustomizationID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}ProfileID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}ID {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate {}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}InvoiceTypeCode {'listAgencyID': '6', 'listID': 'UNCL1001'}
{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}DocumentCurrencyCode {'listAgencyID': '6', 'listID': 'ISO 4217 Alpha'}
{urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2}OrderReference {}

您可以看到查找文本的逻辑是正确的,但是标题错误。由于发票属性,在这里我们无法使用 'IssueDate' 直接查找文本 或 'cbc:IssueDate'

如果您使用过

root.find("{urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2}IssueDate").text

输出:

'2019-11-01'

之前,在IssueDate前面添加了:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2“ 标签标题中的> cbc 。如果不是 “ urn:oasis:names:specification:ubl:schema:xsd:Invoice-2” 会添加在前面。

我希望这能回答您的问题。

答案 1 :(得分:0)

您可以使用BeautifulSoup

from bs4 import BeautifulSoup as BS4

xml_test = """<?xml version="1.0" encoding="UTF-8"?>
    <Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2&#xA;http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
      <cbc:UBLVersionID>2.1</cbc:UBLVersionID>
      <cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
      <cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
      <cbc:ID>201909638</cbc:ID>
      <cbc:IssueDate>2019-11-01</cbc:IssueDate>
      <cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
      <cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
      <cac:OrderReference>ABC</cac:OrderReference>
    </Invoice>"""

soup = BS4(xml_test)

tag = soup.find("cbc:issuedate")

print(tag.text)

结果将是

2019-11-01

如果您有很多issue dates,则可以使用

tags = soup.findAll("cbc:issuedate")
for tag in tags:
    print(tag.text)

我希望对您有帮助

答案 2 :(得分:0)

您也可以使用SimplifiedDoc。

from simplified_scrapy.simplified_doc import SimplifiedDoc 
html = '''
<?xml version="1.0" encoding="UTF-8"?>
<Invoice xmlns="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2" xmlns:cac="urn:oasis:names:specification:ubl:schema:xsd:CommonAggregateComponents-2" xmlns:cbc="urn:oasis:names:specification:ubl:schema:xsd:CommonBasicComponents-2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:specification:ubl:schema:xsd:Invoice-2&#xA;http://docs.oasis-open.org/ubl/os-UBL-2.1/xsd/maindoc/UBL-Invoice-2.1.xsd">
  <cbc:UBLVersionID>2.1</cbc:UBLVersionID>
  <cbc:CustomizationID>urn:www.cenbii.eu:transaction:biitrns010:ver2.0:extended:urn:www.peppol.eu:bis:peppol4a:ver2.0:extended:urn:www.simplerinvoicing.org:si:si-ubl:ver1.1.x</cbc:CustomizationID>
  <cbc:ProfileID>urn:www.cenbii.eu:profile:bii04:ver2.0</cbc:ProfileID>
  <cbc:ID>201909638</cbc:ID>
  <cbc:IssueDate>2019-11-01</cbc:IssueDate>
  <cbc:InvoiceTypeCode listAgencyID="6" listID="UNCL1001">380</cbc:InvoiceTypeCode>
  <cbc:DocumentCurrencyCode listAgencyID="6" listID="ISO 4217 Alpha">EUR</cbc:DocumentCurrencyCode>
  <cac:OrderReference>
  # other stuff
</Invoice>
'''
doc = SimplifiedDoc(html)
print (doc.getElementByTag('cbc:IssueDate').text) # get one
lst = doc.getElementByTag('Invoice').getChildren() # get all
for item in lst:
  print (item.tag,item.text)