使用BeautifulSoup为网页中的每个单词添加链接

时间:2018-06-19 15:22:02

标签: python beautifulsoup

我正在尝试为网页中的每个单词添加href,然后使用添加的href再次保存。为此,我使用的是BeautifulSoup,并且此代码运行正常:

wordToSearch = "war"
for text in soup2.find_all(text=True):
if re.search(r'(\w*)%s\b' %wordToSearch, text):
    text.replaceWith(BeautifulSoup(re.sub(r'(\w*)%s\b' % wordToSearch, r'<a href="http://example.com/%s">%s</a>' %(wordToSearch, wordToSearch), text, re.UNICODE), 'html.parser'))

然后我用以下代码编写新文件:

with open("output1.html", "w") as file:
    file.write(str(soup))

仅当我需要在单个特定单词上添加href时,此方法才能正常工作,但是如果我想为单词列表添加href,我不知道该怎么做:

listOfWords = ["war", "love"]

for text in soup2.find_all(text=True):
    for a in listOfWords:
        if re.search(r'(\w*)%s\b' %a, text):
            text.replaceWith(BeautifulSoup(re.sub(r'(\w*)%s\b' %a, r'<a href="https://it.wiktionary.org/wiki/%s">%s</a>' %(a, a), text, re.UNICODE), 'html.parser'))

这是我运行它时得到的:

Traceback (most recent call last):
  File "./test.py", line 110, in <module>
    text.replaceWith(BeautifulSoup(re.sub(r'(\w*)%s\b' % wordToSearch, r'<a href="http://example.com/%s">%s</a>' %(wordToSearch, wordToSearch), text, re.UNICODE), 'html.parser'))
  File "/Library/Python/2.7/site-packages/bs4/element.py", line 235, in replace_with
    "Cannot replace one element with another when the"
ValueError: Cannot replace one element with another when theelement to be replaced is not part of a tree

1 个答案:

答案 0 :(得分:0)

最简单的方法是在每次通过后重建汤。

from bs4 import BeautifulSoup
import re

html = """
<html>
<p>love blah blah war blah blah  love blah blah war</p>
<p>love blah blah  blah blah  love blah blah </p>
<p>blah blah love blah blah war blah blah  love blah blah war blah blah</p>
</html>

"""
listOfWords = ["war", "love"]
for a in listOfWords:
    soup = BeautifulSoup(html, 'html.parser')
    for text in soup.find_all(text=True):
        if re.search(r'(\w*)%s\b' %a, text):
            text.replaceWith(BeautifulSoup(re.sub(r'(\w*)%s\b' %a, r'<a href="https://it.wiktionary.org/wiki/%s">%s</a>' %(a, a), text, re.UNICODE), 'html.parser'))
    html = str(soup)
print (soup)

输出:

<html>
<p><a href="https://it.wiktionary.org/wiki/love">love</a> blah blah <a href="https://it.wiktionary.org/wiki/war">war</a> blah blah  <a href="https://it.wiktionary.org/wiki/love">love</a> blah blah <a href="https://it.wiktionary.org/wiki/war">war</a></p>
<p><a href="https://it.wiktionary.org/wiki/love">love</a> blah blah  blah blah  <a href="https://it.wiktionary.org/wiki/love">love</a> blah blah </p>
<p>blah blah <a href="https://it.wiktionary.org/wiki/love">love</a> blah blah <a href="https://it.wiktionary.org/wiki/war">war</a> blah blah  <a href="https://it.wiktionary.org/wiki/love">love</a> blah blah <a href="https://it.wiktionary.org/wiki/war">war</a> blah blah</p>
</html>