Question

我是Python和BeautifulSoup的新手。我试图弄清楚如何仅匹配标记<div>元素，这些元素包含属于属性的某种匹配文本模式。例如，在所有'id' : 'testid'或无处不在'class' : 'title'的情况下。

这是我到目前为止所拥有的：

def cleanup(filename):
    fh = open(filename, "r")

    soup = BeautifulSoup(fh, 'html.parser')

    for div_tag in soup.find('div', {'class':'title'}):
        h2_tag = soup.h2_tag("h2")
        div_tag.div.replace_with(h2_tag)
        del div_tag['class']

    f = open("/tmp/filename.modified", "w")
    f.write(soup.prettify(formatter="html5"))
    f.close()

一旦我可以匹配所有这些特定元素，那我就可以弄清楚如何操作属性（删除类，将标签本身从<div>重命名为<h1>，等等）。因此，我知道清理的实际部分可能与当前情况不符。

Answer 1

这似乎足够有效，但请告诉我是否有一种“更好”或“更标准”的方式来实现。

for tag in soup.findAll(attrs={'class':'title'}):
    del tag['class']

Answer 2

.find(tagName, attributes)返回单个元素

.find_all(tagName, attributes)返回多个元素（列表）

更多信息，您可以在doc

中找到

要进行替换，您需要创建元素.new_tag(tagName)并删除属性del element.attrs[attributeName]，例如，参见下文

from bs4 import BeautifulSoup
import requests

html = '''
<div id="title" class="testTitle">
  heading h1
</div>
'''
soup = BeautifulSoup(html)

print 'html before'
print soup

div = soup.find('div', id="title")

#delete class attribute
del div.attrs['class']

print 'html after remove attibute'
print soup

# to replace, create h1 element
h1 = soup.new_tag("h1")
# set text from previous element
h1.string = div.text
# uncomment to set ID
# h1['id'] = div['id']
div.replace_with(h1)

print 'html after replace'
print soup

Python + BeautifulSoup：查找一个HTML标记，其中一个属性包含匹配的文本模式？

2 个答案: