Question

我正在尝试使用.replaceWith替换长html网站中的一些元素（class：method）。为此我使用.descendants并迭代它们以检查dl元素是否是我正在寻找的。但这仅适用于彼此相邻的0 <= X <= 2个元素。连续的每个第3到第n个元素都被＆＃34;忽略＆＃34;。执行相同的代码两次导致连续4个替换的dl元素，依此类推。

for elem in matches:
 for child in elem.descendants:
    if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
         text = "<p>***removed something here***</p>"
         child.dl.replaceWith(BeautifulSoup(text))

（非常愚蠢）解决方案是找到一行中最大的dl元素，将其除以2并经常执行。我想为此获得一个聪明（快速）的解决方案，并且（更重要的是）了解这里出了什么问题。

编辑：用于测试的html网站是这一个：https://docs.python.org/3/library/stdtypes.html，错误可以在章节4.7.1字符串方法中找到（那里有很多方法）

EDIT_2：但我不只是使用那个html网站，而是部分内容。 html-parts存储在一个列表中，我只想让dl-elements被删除＆＃34;如果它们不是第一个html元素（所以我想保留元素，如果它是头部）。

这就是我的代码实际看起来的样子：

from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(open("/home/sven/Bachelorarbeit/python-doc-extractor-for-cado/extractor-application/index.html"))
f = open('test.html','w')    #needs to exist
f.truncate
matches=[]

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements

sections = soup.find_all(['div'], attrs = {'class':'section'})   #grab all section-elements

matches = dl_elems + sections   #merge the lists to get all results

for elem in matches:
  for child in elem.descendants:
      if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
           text = "<p>***removed something here***</p>"
           child.dl.replaceWith(BeautifulSoup(text))


print(matches,file=f)
f.close()

Answer 1

我们的想法是查找包含dl的所有class="method"元素，并将其替换为p标记：

import urllib2
from bs4 import BeautifulSoup, Tag

# get the html
url = "https://docs.python.org/3/library/stdtypes.html"
soup = BeautifulSoup(urllib2.urlopen(url))

# replace all `dl` elements with `method` class
for elem in soup('dl', class_='method'):
    tag = Tag(name='p')
    tag.string = '***removed something here***'
    elem.replace_with(tag)

print soup.prettify()

UPD（适应问题编辑）：

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements
sections = soup.find_all(['div'], attrs={'class': 'section'})   #grab all section-elements

for parent in dl_elems + sections:
    for elem in parent.find_all('dl', {'class': 'method'}):
        tag = Tag(name='p')
        tag.string = '***removed something here***'
        elem.replace_with(tag)

print dl_elems + sections

BeautifulSoup中的.descendants似乎没有按预期工作

1 个答案: