Question

尝试实现以下逻辑：

如果文字中的网址被段落标记包围（示例：<p>URL</p>），请将其替换为相应的链接：<a href="URL">Click Here</a>

原始文件是数据库转储（sql，UTF-8）。某些网址已经以所需的格式存在。我需要修复丢失的链接。

我正在使用一个使用Beautifulsoup的脚本。如果其他解决方案更有意义（正则表达式等），我愿意接受建议。

Answer 1

您可以搜索文本以p开头的所有http元素。然后，replace it with一个链接：

for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
    elm.replace_with(soup.new_tag("a", href=elm.get_text()))

工作代码示例：

from bs4 import BeautifulSoup

data = """
<div>
    <p>http://google.com</p>
    <p>https://stackoverflow.com</p>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
    elm.replace_with(soup.new_tag("a", href=elm.get_text()))

print(soup.prettify())

打印：

<div>
  <a href="http://google.com"></a>
  <a href="https://stackoverflow.com"></a>
</div>

我可以想象这种方法会破裂，但对你来说这应该是一个好的开始。

如果您还想在链接中添加文本，请设置.string属性：

soup = BeautifulSoup(data, "html.parser")
for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
    a = soup.new_tag("a", href=elm.get_text())
    a.string = "link"
    elm.replace_with(a)

锚点（<a href="URL">URL</a>）而不是文本（<p> URL </p>）

1 个答案: