Question

我有多个字符串，我想将HTML标签包装在HTML文档中。我想让文本保持不变，但是将字符串替换为包含该字符串的HTML元素。

此外，我要替换的某些字符串包含要替换的其他字符串。在这种情况下，我想应用较大字符串的替换，而忽略较小字符串的替换。

此外，我只想在这些字符串完全包含在同一元素中时执行此替换。

这是我的替换清单。

replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]

给出以下html：

<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>

我想代替这个：

<html>
<body>
<p>Paragraph contains <span title="foo" class="customclass34">foo</span></p>
<p>Paragraph contains <span id="id79" class="customclass79">foo bar</span</p>
</body>
</html>

到目前为止，我已经尝试使用漂亮的汤类库并以减小字符串长度的顺序遍历我的替换列表，我可以找到我的字符串并将其替换为其他字符串，但是我不知道如何插入在这些时候使用HTML。还是完全有更好的方法。无论我是否将其转换为字符串，尝试用soup.new_tag对象执行字符串替换都会失败。

编辑：意识到我给的例子甚至不符合我自己的规则，修改了例子。

Answer 1

我认为这与您要寻找的非常接近。您可以使用soup.find_all(string=True)仅获取NavigableString元素，然后进行替换。

from bs4 import BeautifulSoup
html="""
<html>
<body>
<p>Paragraph contains foo</p>
<p>Paragraph contains foo bar</p>
</body>
</html>
"""
replacement_list = [
    ('foo', '<span title="foo" class="customclass34">foo</span>'),
    ('foo bar', '<span id="id21" class="customclass79">foo bar</span>')
]
soup=BeautifulSoup(html,'html.parser')
for s in soup.find_all(string=True):
    for item in replacement_list[::-1]: #assuming that it is in ascending order of length
        key,val=item
        if key in s:
            new_s=s.replace(key,val)
            s.replace_with(BeautifulSoup(new_s,'html.parser')) #restrict youself to this built-in parser
            break#break on 1st match
print(soup)

#generate a new valid soup that treats span as seperate tag if you want
soup=BeautifulSoup(str(soup),'html.parser')
print(soup.find_all('span'))

输出：

<html>
<body>
<p>Paragraph contains <span class="customclass34" title="foo">foo</span></p>
<p>Paragraph contains <span class="customclass79" id="id21">foo bar</span></p>
</body>
</html>

[<span class="customclass34" title="foo">foo</span>, <span class="customclass79" id="id21">foo bar</span>]

Answer 2

我已经找到了解决方案。

我必须遍历HTML，以便为每个要包装HTML标签的不同字符串。这似乎效率低下，但是我找不到更好的方法。

我已在所有要插入的标签中添加了一个类，用于检查我要替换的字符串是否属于已替换的较大字符串的一部分。

此解决方案也不区分大小写（它将标签包裹在字符串'fOo'周围），同时保留原始文本的大小写。

def html_update(input_html):
    from bs4 import BeautifulSoup
    import re

    soup = BeautifulSoup(input_html)

    replacement_list = [
        ('foo', '<span title="foo" class="customclass34 replace">', '</span>'),
        ('foo bar', '<span id="id21" class="customclass79 replace">', '</span>')
    ]
    # Go through list in order of decreasing length
    replacement_list = sorted(replacement_list, key = lambda k: -len(k[0]))

    for item in replacement_list:
        replace_regex = re.compile(item[0], re.IGNORECASE)
        target = soup.find_all(string=replace_regex)
        for v in target:
            # You can use other conditions here, like (v.parent.name == 'a')
            # to not wrap the tags around strings within links
            if v.parent.has_attr('class') and 'replace' in v.parent['class']:
                # The match must be part of a large string that was already replaced, so do nothing
                continue 

            def replace(match):
                return '{0}{1}{2}'.format(item[1], match.group(0), item[2])

            new_v = replace_regex.sub(replace, v)
            v.replace_with(BeautifulSoup(new_v, 'html.parser'))
    return str(soup)

Answer 3

处理小文件时，最好逐行读取文件，然后在每一行中替换要替换的内容，然后将所有内容写入新文件。

假设您的文件名为output.html：

replacement_list = {'foo': '<span title="foo" class="customclass34">foo</span>', 'foo bar':'<span id="id21" class="customclass79">foo bar</span>'}

with open('output.html','w') as dest :
    with open('test.html','r') as src :
        for line in src:   #### reading the src file line by line
            str_possible = []
            for string in replacement_list.keys(): #### looping over all the strings you are looking for
                if string in line: ### checking if this string is in the line
                    str_possible.append(string)
            if len(str_possible) >0:
                str_final = max(str_possible, key=len)  ###taking the appropriate one, which is the longest
                line = line.replace(str_final,replacement_list[str_final])

            dest.write(line)

我还建议您检查python中字典的使用，这是我用于replacement_list的对象。

最后，如果行上最多一个字符串，则此代码将起作用。如果有两个，则需要进行一些调整，但这可以为您提供总体思路。

将多个字符串替换为html文档中的元素

3 个答案: