如何从简单的字符串中将标记的文本添加到元素?

时间:2019-07-10 07:18:04

标签: python xml lxml

使用python lxml,我想生成一个etree.Element,其内容取自字符串。我有两种情况:

  1. 这是一个简单的字符串(例如:“ Hello world!”)。
  2. 这是一个带标记的字符串,但是对于python来说,它还是一个字符串,我事先不知道它是一个带标记的字符串(例如:“ Hello !“ )。

如何处理第二种情况?

这是一种幼稚的,不起作用的方式:

>>> from lxml import etree
>>> string = "Hello <value-of select=\"world\"/>!"
>>> xml = etree.Element('root')
>>> xml.text = string
>>> etree.tostring(xml)
... b'<root>Hello &lt;value-of select="world"/&gt;!</root>'

我很清楚,如果知道字符串的结构,则必须使用the lxml tutorial中所述的etree.Element的tail方法。因此,这是一种有效的方法,不能一概而论:

>>> from lxml import etree
>>> xml2 = etree.Element('root')
>>> xml2.text = "Hello "
>>> valueof = etree.SubElement(xml2, 'value-of')
>>> valueof.set('select', 'world')
>>> valueof.tail = '!'
>>> etree.tostring(xml2)
... b'<root>Hello <value-of select="world"/>!</root>'

但是如何在不事先知道确切字符串的情况下自动执行此操作?

我不知道如何解析字符串,以便可以拆分其部分。或者也许我应该尝试另一种方式。

我尝试过:

>>> from lxml import etree
>>> from io import StringIO
>>> string="Hello <value-of select=\"world\"/>!"
>>> tree = etree.parse(StringIO(string))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:81117)
  File "src/lxml/parser.pxi", line 1828, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:118072)
  File "src/lxml/parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:118341)
  File "src/lxml/parser.pxi", line 1729, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:116899)
  File "src/lxml/parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:110886)
  File "src/lxml/parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:105109)
  File "src/lxml/parser.pxi", line 706, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:106817)
  File "src/lxml/parser.pxi", line 635, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:105671)
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

但是由于etree.parse需要格式正确的xml并且没有根元素,因此它会失败。所以我尝试了一下,希望它不会那么严格:

>>> tree = etree.parse(StringIO(string), etree.XMLParser(recover=True))
>>> etree.tostring(tree)

但是输出为空,因此看来我无法解析字符串以将结果树添加到现有树中...这是我需要做的事情,因为我是从头开始编写xml的。 / p>

回到我的问题:如何处理我先前提出的2个案例?

1 个答案:

答案 0 :(得分:0)

只需将字符串(简单或标记的)包装在根元素中,以使其格式正确的XML。

from lxml import etree

simple = "Hello world!"
tagged = "Hello <value-of select=\"world\"/>!"

xml1 = "<root>" + simple + "</root>"
xml2 = "<root>" + tagged + "</root>"

# fromstring() returns an Element object 
elem1 = etree.fromstring(xml1) 
elem2 = etree.fromstring(xml2)