使用lxml查找/替换文本,同时保持HTML结构

时间:2016-02-26 12:24:11

标签: python replace lxml

我试图构建一个简单的脚本,根据字典将超链接插入HTML。

它可以使用简单的str.replace()进行操作,但在遇到现有的超链接或其他标记时会中断,然后会被渲染掉。

我已经看到了问题的广泛answers on StackOverflow,建议使用lxml和BeatifulSoup,但我遇到了具体的问题,并希望有人能帮我推进正确的方向。< / p>

以下是一个示例示例:

test_string = """<p>
    COLD and frosty <a href="/weather">weather</a> has gripped the area and the Met Office have predicted strong winds along will the possibility of sleet and snow later in the week.
</p>
<p>
    For more details take a look at the full five-day weather forecast below.
</p>
<h3>Weather map</h3>
...
<p>
    <strong>Today:</strong> It will be a cold start today with a hard frost, perhaps with isolated wintry showers in the east at first. There will be plenty of sunny spells through the day, especially towards the west. Feeling cold. Maximum Temperature 6C.
</p>
<p>
    <strong>Tonight:</strong> Tonight will be quiet with long clear spells and light winds. A widespread frost is expected with very cold temperatures. Minimum Temperature -5C.
</p>
<p>
    <strong>Tuesday:</strong> Tuesday will be a similar day with a cold, frosty but bright start. There will be plenty of sunshine, with skies clouding over into the evening. Maximum Temperature 7C.
</p>"""
import lxml.html
root = lxml.html.fromstring(test_string)
paragraphs = root.xpath("//*")

for p in paragraphs:
    if p.tag == "p":
       DO SOMETHING

逻辑是只处理文章中的段落标记(并避免标题标记和表格)。

然而,问题在于我无法找到一种方法来处理标签中的文本,以便保留标签的位置,例如

如何使用lxml将段落分成块:

  1. 我可以在其上运行查找/替换
  2. 的文本
  3. 必须保留的标签
  4. 以这种方式我可以保留整体结构?

    更新: 例如,我可能希望将Met Office替换为<a href="met_office.html">Met Office</a>,同时保留其他文本,因此:

    <p>
            COLD and frosty <a href="/weather">weather</a> has gripped the area and the <a href="met_office.html">Met Office</a> have predicted strong winds along will the possibility of sleet and snow later in the week.
        </p>
    

2 个答案:

答案 0 :(得分:0)

例如,要替换段落文字,请使用node.set(attribute, value),然后使用lxml.html.tostring()保存更改:

p.set('string', '')
result = lxml.html.tostring(root)

答案 1 :(得分:0)

只需使用BeautifulSoup(可选)使用lxml:

from bs4 import BeautifulSoup
soup = BeautifulSoup(test_string)
pars = soup.findall('p')
for p in pars:
  p.string = 'this text was replaced'
print s.prettify()

结果是:

<html>
 <body>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <h3>
   Weather map
  </h3>
  ...
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
  <p>
   this text was replaced
  </p>
 </body>
</html>

更新:如果要在段落文本中搜索字符串并将其包装在新的<a>标记中,则应将正则表达式与new_tag() function in BeautifulSoup结合使用工作。