Question

我正在尝试编写一些BeautifulSoup代码，它将获取由标签包围的每段文本并将标签更改为标签 - 但是，仅当它只是一条没有其他书写/输出文本的行时。

这可能吗？

Before

到这个

After

但这将保持不变：

Unchanged

我知道以下内容可以转换所有强者。我怎样才能得到重要的那些？

import BeautifulSoup

if __name__ == "__main__":
    data = """
<html>
<h2 class='someclass'>some title</h2>
<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>

    """
    soup = BeautifulSoup.BeautifulSoup(data)
    h2 = soup.find('strong')
    h2.name = 'h1'
    print soup

Answer 1

您可以找到所有strong元素并检查.parent的长度：

from bs4 import BeautifulSoup

data = """
<html>
<p><strong>Like this</strong></p>
<p>Hello, <strong>world</strong>
</html>
"""

soup = BeautifulSoup(data)
for strong in soup.find_all('strong'):
    if len(strong.parent) == 1:
        strong.name = 'h1'
print soup

打印（请参阅第一个strong标签，第二个不是）：

<html>
<body>
    <p><h1>Like this</h1></p>
    <p>Hello, <strong>world</strong></p>
</body>
</html>

或者，以更简洁的形式：

for strong in soup.find_all('strong', lambda x: x and len(x.parent) == 1):
    strong.name = 'h1'

作为旁注，您正在使用不再维护的BeautifulSoup3 ;考虑升级到BeautifulSoup4：

pip install beautifulsoup4

Answer 2

嗯...这可能效率不高，但确实写得更简单：

data = data.replace('<p><strong>', '<p><h2>')
data = data.replace('</strong></p>', '</h2></p>')

还是我误解了str.replace（）的基本原理？

虽然这并不复杂，但是如果html是一致的话，那就完成了工作

编辑：使用正则表达式的更复杂的解决方案：

import re

data = re.sub(r'<[Pp]>[\s]*<[Ss][Tt][Rr][Oo][Nn][Gg]>', '<p><h2>', data)
data = re.sub(r'</[Ss][Tt][Rr][Oo][Nn][Gg]>[\s]*</[Pp]>', '</h2></p>', data)

将<strong>标记替换为h2标记</strong>

2 个答案: