删除newline和whitespace用python Xpath解析XML

时间:2016-05-24 23:24:04

标签: python xml xpath newline

这是xml文件http://www.diveintopython3.net/examples/feed.xml

我的代码是 enter image description here

我的结果是enter image description here

我的问题是

  1. 如何删除文本中的\n和以下空格

  2. 如何获取文本为“潜入标记”的节点,如何搜索文本语法

1 个答案:

答案 0 :(得分:1)

只需在每个节点上调用normalize-space(.)

import lxml.etree as et

xml = et.parse("feed.xml")
ns = {"ns": 'http://www.w3.org/2005/Atom'}
for n in xml.xpath("//ns:category", namespaces=ns):
    t  = n.xpath("./../ns:summary", namespaces=ns)[0]
    print(t.xpath("normalize-space(.)"))

输出:

Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
Putting an entire chapter on one page sounds bloated, but consider this — my longest chapter so far would be 75 printed pages, and it loads in under 5 seconds… On dialup.
The accessibility orthodoxy does not permit people to question the value of features that are rarely useful and rarely used.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.
These notes will eventually become part of a tech talk on video encoding.

您的所有换行符都已删除,多个空格已替换为单个空格。

你的问题的第二部分是要求 title 标签,因为这是唯一带有你正在寻找的文字的标签,但要专门找到具有该确切文本的标题,那就是:

xml.xpath("//ns:title[text()='dive into mark']", namespaces=ns)

如果您想要包含该文本的任何节点,您只需将 ns:title 替换为通配符:

xml.xpath("//*[text()='dive into mark']", namespaces=ns)