LXML-如何将所有出现的特定文本包装在标签中

时间:2019-03-08 13:11:46

标签: html parsing lxml

考虑以下HTML:

<div>
      Some foo text foo
      <p> text inside paragraph foo and also foo and <b> nested foo</b> and foo </p>
      foo is also here and can occur many times foo foo
      <p> here <a>foo</a> already appears inside a link so it is not changed</p>
      foo, yeah!
</div>

我需要将所有'foo'出现的内容包装在可点击的链接(<a>元素)中,除了已经出现在<a>中的出现之外,因此预期的输出是:

<div>
      Some <a>foo</a> text <a>foo</a>
      <p> text inside paragraph <a>foo</a> and also <a>foo</a> and <b> nested <a>foo</a></b> and <a>foo</a> </p>
      <a>foo</a> is also here and can occur many times <a>foo</a> <a>foo</a>
      <p> here <a> foo </a> appears inside a link so it is not changed</p>
      <a>foo</a>, yeah!
    </div>

是否有使用lxml做到这一点的简单方法?最初,对我来说,原始的子字符串替换更有意义,但是有一个要求,即如果某些事例出现在HTML的特定元素之内,则不得更改。

1 个答案:

答案 0 :(得分:0)

好吧,BeautifulSoup似乎比原始lxml更好

此代码效果很好:

from bs4 import BeautifulSoup

x = """<div>
      Some foo text foo
      <p> text inside paragraph foo and also foo and <b> nested foo</b> and foo </p>
      foo is also here and can occur many times foo foo
      <p> here <a>foo</a> already appears inside a link so it is not changed</p>
      foo, yeah!
</div>"""

s = BeautifulSoup(x, 'html.parser')
print(s)

for text_node in list(s.strings):
      if not text_node.parent.name=='a':
            text_node.replace_with(BeautifulSoup(text_node.string.replace('foo', '<a>foo</a>'), 'html.parser'))

print(s)

编辑:使用html.parser很重要。构造替换HTML片段时传递“ lxml”效果不佳(将HTML片段包装在html标签中)