如何用lxml

时间:2018-08-19 00:09:24

标签: python html-parsing lxml

我要替换内部的所有节点文本而不更改其结构。 以下代码仅更改了标记文本,而未在文本节点中更改(itertext()不够,因为它仅返回“字符串”)。我找到了一种方法here,但是检测所有结构有些麻烦。

def replace_text(tree): # tree: Element from lxml
    for tag in tree.iter():
        if not len(tag):
            if tag.text is not None:
                tag.text = 'z1'
        else:
            pass

发件人:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Title</title>
</head>
<body>
<div>
  a1
  <div>
    a2
    <div>
      a3
      <p>
        a4
        a5
      </p>
      <p>
        a6
        a7
        <br>
        <span>a8</span>
      </p>
    </div>
  </div>
</div>
</body>
</html>

期望:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Title</title>
</head>
<body>
<div>
  z1
  <div>
    z2
    <div>
      z3
      <p>
        z4
        z5
      </p>
      <p>
        z6
        z7
        <br>
        <span>z8</span>
      </p>
    </div>
  </div>
</div>
</body>
</html>

0 个答案:

没有答案