从XML标签创建原始文本

时间:2019-09-06 15:19:09

标签: python xml elementtree

我有一些通过NLP处理器运行的XML。我必须在Python脚本中修改输出,所以对我来说没有XSLT。我正在尝试从XML中提取所有<TXT></TXT>中的原始文本作为字符串,但是我仍然坚持如何从ElementTree中提取原始文本。

到目前为止,我的代码是

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.parse(xml_doc) # xml_doc is actually a file, but for reproducability it's the above xml

,然后从那里我要提取TXT中的所有内容作为剥离了标签的字符串。对于后续的其他一些进程,它必须是字符串。我想看起来像下面的output_txt

output_txt = "George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222."

我想这应该相当容易和直接,但我只是想不通。我尝试使用this解决方案,但得到了AttributeError: 'ElementTree' object has no attribute 'itertext',它将删除xml中的所有标签,而只是剥离<TXT></TXT>之间的标签。

1 个答案:

答案 0 :(得分:2)

通常我会使用普通的XPath来做到这一点:

normalize-space(//TXT)

但是,ElementTree中对XPath的支持是有限的,因此您只能在lxml中做到这一点。

要在ElementTree中执行此操作,我将执行与您在问题中链接到的答案类似的操作;使用tostring使用method="text"将其强制为纯文本。您还希望规范化空白。

示例...

import xml.etree.ElementTree as ET

xml_doc = """<?xml version="1.0" encoding="UTF-8"?>
<NORMDOC>
   <DOC>
      <DOCID>112233</DOCID>
      <TXT>
        <S sid="112233-SENT-001"><ENAMEX type="PERSON" id="PER-112233-001">George Washington</ENAMEX> and <ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> were both founding fathers.</S>
        <S sid="112233-SENT-002"><ENAMEX type="PERSON" id="PER-112233-002">Thomas Jefferson</ENAMEX> has a social security number of <IDEX type="SSN" id="SSN-112233-075">222-22-2222</IDEX>.</S>
      </TXT>
   </DOC>
</NORMDOC>
"""

tree = ET.fromstring(xml_doc)

txt = tree.find(".//TXT")
raw_text = ET.tostring(txt, encoding='utf8', method='text').decode()
normalized_text = " ".join(raw_text.split())
print(normalized_text)

打印输出...

George Washington and Thomas Jefferson were both founding fathers. Thomas Jefferson has a social security number of 222-22-2222.