我正在尝试匹配HTML文档中的字符串并特别突出显示它。 我使用了BeautifulSoup和html.parser。
我到目前为止所尝试的是使用find_all()并传递要匹配的字符串但它没有帮助,因为它返回元素中存在的整个文本。
我希望您指导我如何定位文档中的特定字符串并突出显示它。
例如:标记:
<p>Lorem is simply dummy text of the printing and typesetting industry.</p>
<p>Lorem has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it
突出显示后:标记:
<p><mark>Lorem</mark> is simply dummy text of the printing and typesetting industry.</p>
<p><mark>Lorem</mark> has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it
预期产出:
Lorem 只是印刷和排版行业的虚拟文本。
自16世纪以来,Lorem 一直是业界标准的虚拟文本,当时一台未知的打印机采用了类型的厨房并加扰了它
如果我能得到一个字符串数组,我可以用标记标记替换它。
GOT SOAR与beautifulsoup :
import urllib.request
import re
from bs4 import BeautifulSoup
sauce = urllib.request.urlopen('http://courseweb.stthomas.edu/mjodonnell/cojo258/resume/simple_code.html').read()
soup = BeautifulSoup(sauce, 'html.parser')
body = soup.find('body')
results = body.find_all(text=re.compile(r'bastyr', re.I))
print(results)
答案 0 :(得分:0)
也许你可以试试这样的东西
soup = bs("<p>Lorem is simply dummy text of the printing and typesetting industry.</p> ",'lxml')
# This is the word we want to put a tag around
special_word = 'Lorem'
content_orig = soup.p.text
split_content_orig = content_orig.split(special_word)
soup.p.string = ''
soup.p.insert(len(soup.p), split_content_orig[0])
for i_word in split_content_orig[1:]:
# We need to create a new tag in every loop, otherwise it moves the tag around. Probably has something to do with each tag having a unique id()
new_tag = soup.new_tag('mark')
new_tag.string = special_word
soup.p.insert(len(soup.p), new_tag)
soup.p.insert(len(soup.p), i_word)
我遇到了类似的问题,并在此提出了我的问题:
Replace text with bold version in Beautiful Soup
也许其他人会回复它并找到更好的解决方案。但同时你可以使用这个我猜
答案 1 :(得分:0)
如果您使用的是更复杂的html,则可能不想替换html中任何位置的文本。这可能会破坏链接,图像,样式等。
您只能替换文本实例:
def highlight_html(html, re_highlighter):
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.strings:
highlighted = re_highlighter.sub(r"<mark>\1</mark>", tag)
if highlighted != tag:
highligted_soup = BeautifulSoup(highlighted, 'html.parser')
tag.replace_with(highligted_soup)
return str(soup)
# create your re rule as needed...
re_highlighter = re.compile(r"Lorem...", flags=re.IGNORECASE)
highlighted_html = highlight_html(html, re_highlight)