将html页面中的`\ n`替换为python LXML中的空格

时间:2014-08-21 06:27:38

标签: python lxml

我有一个不清楚的xml并使用python lxml模块处理它。我希望在处理之前用\n替换内容中的所有space,如何为所有元素的文本执行此操作。

修改 我的xml示例:

<root>
    <a> dsdfs\n dsf\n sdf\n</a>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    ....
    ....
    ....
    ....
</root>

当我打印ittertext时,我不想在输出中得到它:

root = #get root element
for i in root.ittertext():
   print i

dsdfs  dsf  sdf
dsdfs  dsf  sdf
sdf  nsdf sdf  

2 个答案:

答案 0 :(得分:1)

下面的代码会将xml解析为字符串,然后用\n替换space,然后写入新的xml文件。您可以在两者之间进行其他处理,具体取决于您想要做什么。

from lxml import etree 
tree = etree.parse('some.xml') 
root = tree.getroot()
# Get the whole XML content as  string
xml_in_str = etree.tostring(root)

# Replace all \n with space
new_xml_data = xml_in_str.replace(r'\n', ' ')

# Do the processing with the new_xml_data string which is formatted

# Maybe also write to a new XML file, without the \n
with open('newxml.xml', 'w') as f:
    f.write(new_xml_data)

some.xml看起来像:

<root>
    <a> dsdfs\n dsf\n sdf\n</a>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
    <bds> 
        <d>sdf\n\n\n\n\n\n</d>
        <d>sdf\n\n\nsdf\nsdf\n\n</d>
    </bds>
</root>

newxml.xml看起来像:

<root>
    <a> dsdfs  dsf  sdf </a>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
    <bds> 
        <d>sdf      </d>
        <d>sdf   sdf sdf  </d>
    </bds>
</root>

答案 1 :(得分:-1)

您尝试过的代码究竟是什么?字符串对于初学者来说是不可变的,并且没有&#34; replaceall&#34; Python中的方法

for i in root_elem.itertext():
    j = i.replace('\n',' ')
    print(j+'\n')  # or some fp.write call to a new file