我尝试在html文件中查找和替换术语(带有链接),但是我喜欢维护其他html结构。
首先,我尝试使用string
查找标签,但是由于子标签,该字符串未包含所有文本,因此替换了
修改后的字符串会删除所有子标记。然后,我尝试使用get_text()
方法,但是为了替换,它具有
同样的问题。最后,我使用__str__()
方法获取了每个段落的内容,以获取所有html内容,并且
将其替换为新的BeautifulSoup对象(以在其中包含所有标签):
import os
from bs4 import BeautifulSoup
import re
def Exclude_paragraph(cls_name):
return cls_name is None or cls_name not in ("excluded1", "excluded2")
def Replace_by_ref(m, term):
return "<a href='#" + term["anchor"] + "'>" + m.group(0) + "</a>"
terms = [{"line": "special configurable device", "anchor": "#term_1"},
{"line": "analytical performance", "anchor": "term_2"},
{"line": "instructions for use", "anchor": "term_4"},
{"line": "calibrator", "anchor": "term_3"},
{"line": "label", "anchor": "term_6"},
{"line": "kit", "anchor": "term_5"}]
# There are almost 100 terms searched in thousands of lines
with open(os.path.join("HTML", "test2.html"), "r", encoding="utf-8") as file:
html = file.read()
html_bs = BeautifulSoup(html, "html.parser")
for term in terms:
regex = r"\b" + term["line"] + r"s?\b"
regex = re.compile(regex, re.IGNORECASE)
body_txts = html_bs.body.find_all("p", class_=Exclude_paragraph)
for paragraph in body_txts:
body_tag_html = paragraph.__str__()
new_tag = regex.sub(lambda m: Replace_by_ref(m, term), body_tag_html)
if new_tag != body_tag_html:
print("\nFound:", term["line"])
print("String:", paragraph.string)
print("Get_text():", paragraph.get_text())
print("Replacement:", new_tag)
paragraph.replace_with(BeautifulSoup(new_tag, "html.parser"))
最后,将保存修改后的html文件(此处不包括)。但是,当某些术语包含html标签(例如
<i>special</i> configurable device
(或其他)?首先,我的正则表达式根本找不到这个,更不用说如何替换它了。有什么想法吗?
编辑:添加了简短的示例HTML代码:
<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class='excluded1'>No terms here.</p>
<h2>Glossary</h2>
<dl>
<dt id="term_2">analytical performance</dt><dd>...</dd>
<dt id="term_3">calibrator</dt><dd>...</dd>
<dt id="term_4">instructions for use</dt><dd>...</dd>
<dt id="term_5">kit</dt><dd>...</dd>
<dt id="term_6">label</dt><dd>...</dd>
<dt id="term_1">special configurable device</dt><dd>...</dd>
</dl>
</body>
</html>
原始的html代码更长,包括文本中的数千个术语。我已经为词汇表创建了ID,现在我尝试对其进行交叉引用。
答案 0 :(得分:0)
这应该给您您所需要的。遍历您的terms
列表,然后在HTML中寻找id=
与terms["anchor"]
匹配的标签。然后将其替换为所需的链接。
from bs4 import BeautifulSoup
html = """
<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class='excluded1'>No terms here.</p>
<h2>Glossary</h2>
<dl>
<dt id="term_2">analytical performance</dt><dd>...</dd>
<dt id="term_3">calibrator</dt><dd>...</dd>
<dt id="term_4">instructions for use</dt><dd>...</dd>
<dt id="term_5">kit</dt><dd>...</dd>
<dt id="term_6">label</dt><dd>...</dd>
<dt id="term_1">special configurable device</dt><dd>...</dd>
</dl>
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
terms = [{"line": "special configurable device", "anchor": "term_1"},
{"line": "analytical performance", "anchor": "term_2"},
{"line": "instructions for use", "anchor": "term_4"},
{"line": "calibrator", "anchor": "term_3"},
{"line": "label", "anchor": "term_6"},
{"line": "kit", "anchor": "term_5"}]
for t in terms:
# Identify the <dt> tag you want to replace.
anchor = t["anchor"]
original_tag = soup.find("dt", id=anchor)
# Get rid of the <dd> tag that follows it.
original_tag.find_next("dd").decompose()
# Generate the new tag as a BS object
new_tag = soup.new_tag("a", href=anchor)
new_tag.string = t["line"]
# Do the replacement
original_tag.replaceWith(new_tag)
print(soup)
输出为:
<html><head></head>
<body><h1>Test document</h1>
<p><i>special</i> configurable device, analytical performance, calibrator, instructions for use, kit, label.</p>
<p class="excluded1">No terms here.</p>
<h2>Glossary</h2>
<dl>
<a href="term_2">analytical performance</a>
<a href="term_3">calibrator</a>
<a href="term_4">instructions for use</a>
<a href="term_5">kit</a>
<a href="term_6">label</a>
<a href="term_1">special configurable device</a>
</dl>
</body>
</html>