Question

我使用Beautiful Soup替换文字

示例我的代码

for str in soup.find('body').find_all(string=True):
   fix_str = re.sub(...)
   str.replace_with(fix_str)

如何跳过标记＆＃39;脚本＆＃39;并标记＆＃39;评论＆lt; - ！ - ＆GT;＆＃39;

如何确定str中的哪个元素或标签？

提前谢谢

Answer 1

如果您获取每个文本项的父项，则可以确定它是来自<script>标记还是来自HTML注释。如果没有，则可以使用您的replace_with()函数将文字用于re.sub()：

from bs4 import BeautifulSoup, Comment

html = """<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>

<body>
Some text 1
<!-- a comment -->
<!-- a comment -->
Some text 2
<!-- a comment -->
<script>a script</script>
Some text 2
</body>
</html>"""

soup = BeautifulSoup(html, "html.parser")

for text in soup.body.find_all(string=True):
    if text.parent.name != 'script' and not isinstance(text, Comment):
        text.replace_with('new text')   # add re.sub() logic here

print soup

为您提供以下新HTML：

<html>
<head>
<!-- a comment -->
<title>A title</title>
<script>a script</script>
</head>
<body>new text<!-- a comment -->new text<!-- a comment -->new text<!-- a comment -->new text<script>a script</script>new text</body>
</html>

美丽的汤跳过评论和脚本标记

1 个答案: