使用BeautifulSoup匹配html文档中的字符串,并将其突出显示在任何位置

时间:2018-02-06 08:15:07

标签: python python-3.x beautifulsoup html-parsing

我正在尝试匹配HTML文档中的字符串并特别突出显示它。 我使用了BeautifulSoup和html.parser。

我到目前为止所尝试的是使用find_all()并传递要匹配的字符串但它没有帮助,因为它返回元素中存在的整个文本。

我希望您指导我如何定位文档中的特定字符串并突出显示它。

例如:标记:

 <p>Lorem  is simply dummy text of the printing and typesetting industry.</p> 
 <p>Lorem has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it

突出显示后:标记:

 <p><mark>Lorem</mark> is simply dummy text of the printing and typesetting industry.</p> 
 <p><mark>Lorem</mark> has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it

预期产出:

Lorem 只是印刷和排版行业的虚拟文本。

自16世纪以来,

Lorem 一直是业界标准的虚拟文本,当时一台未知的打印机采用了类型的厨房并加扰了它

如果我能得到一个字符串数组,我可以用标记标记替换它。

GOT SOAR与beautifulsoup

 import urllib.request
 import re
 from bs4 import BeautifulSoup
 sauce = urllib.request.urlopen('http://courseweb.stthomas.edu/mjodonnell/cojo258/resume/simple_code.html').read()
 soup = BeautifulSoup(sauce, 'html.parser')


 body = soup.find('body')

 results = body.find_all(text=re.compile(r'bastyr', re.I))

 print(results)

2 个答案:

答案 0 :(得分:0)

也许你可以试试这样的东西

soup = bs("<p>Lorem  is simply dummy text of the printing and typesetting industry.</p> ",'lxml')

# This is the word we want to put a tag around
special_word = 'Lorem'
content_orig = soup.p.text
split_content_orig = content_orig.split(special_word)

soup.p.string = ''  
soup.p.insert(len(soup.p), split_content_orig[0])

for i_word in split_content_orig[1:]:
# We need to create a new tag in every loop, otherwise it moves the tag around. Probably has something to do with each tag having a unique id()
    new_tag = soup.new_tag('mark')
    new_tag.string = special_word
    soup.p.insert(len(soup.p), new_tag)
    soup.p.insert(len(soup.p), i_word)

我遇到了类似的问题,并在此提出了我的问题:

Replace text with bold version in Beautiful Soup

也许其他人会回复它并找到更好的解决方案。但同时你可以使用这个我猜

答案 1 :(得分:0)

如果您使用的是更复杂的html,则可能不想替换html中任何位置的文本。这可能会破坏链接,图像,样式等。

您只能替换文本实例:


def highlight_html(html, re_highlighter):
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.strings:
        highlighted = re_highlighter.sub(r"<mark>\1</mark>", tag)
        if highlighted != tag:
            highligted_soup = BeautifulSoup(highlighted, 'html.parser')
            tag.replace_with(highligted_soup)
    return str(soup)

# create your re rule as needed...
re_highlighter = re.compile(r"Lorem...", flags=re.IGNORECASE)
highlighted_html = highlight_html(html, re_highlight)