我正在尝试使用.replaceWith替换长html网站中的一些元素(class:method)。为此我使用.descendants并迭代它们以检查dl元素是否是我正在寻找的。但这仅适用于彼此相邻的0 <= X <= 2个元素。连续的每个第3到第n个元素都被&#34;忽略&#34;。执行相同的代码两次导致连续4个替换的dl元素,依此类推。
for elem in matches:
for child in elem.descendants:
if not isinstance(child, NavigableString) and child.dl is not None and 'method' in child.dl.get('class'):
text = "<p>***removed something here***</p>"
child.dl.replaceWith(BeautifulSoup(text))
(非常愚蠢)解决方案是找到一行中最大的dl元素,将其除以2并经常执行。我想为此获得一个聪明(快速)的解决方案,并且(更重要的是)了解这里出了什么问题。
编辑:用于测试的html网站是这一个:https://docs.python.org/3/library/stdtypes.html,错误可以在章节4.7.1字符串方法中找到(那里有很多方法)
EDIT_2:但我不只是使用那个html网站,而是部分内容。 html-parts存储在一个列表中,我只想让dl-elements被删除&#34;如果它们不是第一个html元素(所以我想保留元素,如果它是头部)。
这就是我的代码实际看起来的样子:
from bs4 import BeautifulSoup, NavigableString
soup = BeautifulSoup(open("/home/sven/Bachelorarbeit/python-doc-extractor-for-cado/extractor-application/index.html"))
f = open('test.html','w') #needs to exist
f.truncate
matches=[]
dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']}) # grab all possible dl-elements
sections = soup.find_all(['div'], attrs = {'class':'section'}) #grab all section-elements
matches = dl_elems + sections #merge the lists to get all results
for elem in matches:
for child in elem.descendants:
if not isinstance(child, NavigableString) and child.dl is not None and 'method' in child.dl.get('class'):
text = "<p>***removed something here***</p>"
child.dl.replaceWith(BeautifulSoup(text))
print(matches,file=f)
f.close()
答案 0 :(得分:1)
我们的想法是查找包含dl
的所有class="method"
元素,并将其替换为p
标记:
import urllib2
from bs4 import BeautifulSoup, Tag
# get the html
url = "https://docs.python.org/3/library/stdtypes.html"
soup = BeautifulSoup(urllib2.urlopen(url))
# replace all `dl` elements with `method` class
for elem in soup('dl', class_='method'):
tag = Tag(name='p')
tag.string = '***removed something here***'
elem.replace_with(tag)
print soup.prettify()
UPD(适应问题编辑):
dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']}) # grab all possible dl-elements
sections = soup.find_all(['div'], attrs={'class': 'section'}) #grab all section-elements
for parent in dl_elems + sections:
for elem in parent.find_all('dl', {'class': 'method'}):
tag = Tag(name='p')
tag.string = '***removed something here***'
elem.replace_with(tag)
print dl_elems + sections