使用lxml

时间:2019-05-29 08:02:04

标签: python xml lxml

我正在使用lxml编写XML文件,并且在其中一个节点中,要编写的内容是一个很长的字符串。 我正在寻找一种将这些字符串包装在XML节点中的方法。

现在,我尝试如下:

from lxml import etree


def lines_lenght(string, width):
    words = string.split()
    for i in range(0, len(words), width):
        yield " ".join(words[i:i+width])

s = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum. In hac habitasse platea dictumst. Sed laoreet mi eu nisi condimentum, sit amet vestibulum purus elementum. Nam a eros mi. 
"""


root = etree.Element("corpus")

doc = etree.ElementTree(root)

article_node = etree.SubElement(root, "article")

final_content = "\n".join(lines_lenght(s, 10))
article_node.text = final_content

doc.write("corpus.xml", xml_declaration=True, encoding="utf-8")

但是在生成的XML文件中,似乎没有保留换行符。根据{{​​3}},我尝试使用
而不是\ n,但是结果是相同的。

有什么提示可以帮助我吗?

编辑:这是我尝试实现的目标的预览:

<corpus>  
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in  
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis  
egestas. Orci varius natoque penatibus et magnis dis parturient   montes</article>  
</corpus>  

代替:

<corpus>
<article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis egestas. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.</article>
</corpus>

1 个答案:

答案 0 :(得分:0)

好吧,到达那里花了一段时间,在途中,我不得不寻求this answer here的帮助,并且退出了lxml(正如其他人所说的,它是一个很棒的库,但是有很多限制),内置的python。

像您一样开始,但在article_node.text = final_content之后(doc.write()之前)停止。并从上面链接的答案中添加:

def indent(elem, level=0):
i = "\n" + level*"  "
if len(elem):
    #print(len(elem))
    if not elem.text or not elem.text.strip():
        elem.text = i + "  "
    if not elem.tail or not elem.tail.strip():
        elem.tail = i
    for elem in elem:
        indent(elem, level+1)
    if not elem.tail or not elem.tail.strip():
        elem.tail = i

else:
    if level and (not elem.tail or not elem.tail.strip()):
        elem.tail = i

然后转到:

import xml.etree.ElementTree as ET
root2 = ET.fromstring(etree.tostring(doc))
tree = ET.ElementTree(root2)

indent(root2)
tree.write("corpus.xml", encoding="utf-8", xml_declaration=True)

要测试:

with open("corpus.xml") as f:
    print(f.read())

输出:

<?xml version='1.0' encoding='utf-8'?>
<corpus>
  <article>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec in
enim at arcu tincidunt tristique. Ut commodo dui hendrerit lobortis
egestas. Orci varius natoque penatibus et magnis dis parturient montes,
nascetur ridiculus mus. Sed laoreet interdum enim ut cursus. Fusce
condimentum dictum dictum. Morbi feugiat bibendum enim, ut mollis turpis
tincidunt vitae. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Fusce libero ante, consectetur at sollicitudin at, eleifend lacinia ipsum.
In hac habitasse platea dictumst. Sed laoreet mi eu nisi
condimentum, sit amet vestibulum purus elementum. Nam a eros mi.</article>
</corpus>

更熟悉xml库的人也许可以使它更短,但这是我能做的最好的事情...