Question

我正在尝试使用python脚本使用XML.etree.ElementTree模块生成包含数据表中文本的HTML文档。我想格式化一些单元格以包含html标记，通常是 或标记。当我生成一个字符串并将其写入文件时，我相信XML解析器正在将这些标记转换为单个字符。输出将标记显示为文本而不是将它们作为标记处理。这是一个简单的例子：

import xml.etree.ElementTree as ET

root = ET.Element('html')
   #extraneous code removed
td = ET.SubElement(tr, 'td')
td.text = 'This is the first line <br /> and the second'

tree = ET.tostring(root)
out = open('test.html', 'w+')           
out.write(tree)                     
out.close()

当您打开生成的'test.html'文件时，它会显示与输入完全相同的文本字符串：'这是第一行＆lt; br /＆gt;和第二个'。

HTML文档本身显示了源代码中的问题。似乎解析器将标记中的“小于”和“大于”符号替换为这些符号的HTML表示：

    <!--Extraneous code removed-->
<td>This is the first line %lt;br /&gt; and the second</td>

显然，我的意图是让文档处理标记本身，而不是将其显示为文本。我不确定是否有不同的解析器选项可以通过以使其工作，或者我应该使用不同的方法。如果可以解决问题，我愿意使用其他模块（例如lxml）。为方便起见，我主要使用内置的XML模块。

我唯一能想到的就是在编写文件之前用re替换修改最终字符串：

tree = ET.tostring(root)
tree = re.sub(r'&lt;','<',tree)
tree = re.sub(r'&gt;','>',tree)

这样可行，但似乎可以通过在xml中使用其他设置来避免这种情况。有什么建议吗？

Answer 1

您可以使用tail和td br属性来构建您想要的文本：

import xml.etree.ElementTree as ET


root = ET.Element('html')
table = ET.SubElement(root, 'table')
tr = ET.SubElement(table, 'tr')
td = ET.SubElement(tr, 'td')
td.text = "This is the first line "
# note how to end td tail
td.tail = None
br = ET.SubElement(td, 'br')
# now continue your text with br.tail
br.tail = " and the second"

tree = ET.tostring(root)
# see the string
tree
'<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>'

with open('test.html', 'w+') as f:
    f.write(tree)

# and the output html file
cat test.html
<html><table><tr><td>This is the first line <br /> and the second</td></tr></table></html>

作为旁注，要包含并附加文字但仍在<td>内，使用tail也会产生预期效果：

...
td.text = "this is first line "
sup = ET.SubElement(td, 'sup')
sup.text = "this is second"
# use tail to continue your text
sup.tail = "well and the last"

print ET.tostring(root)
<html><table><tr><td>this is first line <sup>this is second</sup>well and the last</td></tr></table></html>

在Python中将XML标记添加到XML.ElementTree元素的文本中

1 个答案: