BeautifulSoup中的.descendants似乎没有按预期工作

时间:2014-04-22 21:02:02

标签: python html html-parsing beautifulsoup

我正在尝试使用.replaceWith替换长html网站中的一些元素(class:method)。为此我使用.descendants并迭代它们以检查dl元素是否是我正在寻找的。但这仅适用于彼此相邻的0 <= X <= 2个元素。连续的每个第3到第n个元素都被&#34;忽略&#34;。执行相同的代码两次导致连续4个替换的dl元素,依此类推。

for elem in matches:
 for child in elem.descendants:
    if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
         text = "<p>***removed something here***</p>"
         child.dl.replaceWith(BeautifulSoup(text))

(非常愚蠢)解决方案是找到一行中最大的dl元素,将其除以2并经常执行。我想为此获得一个聪明(快速)的解决方案,并且(更重要的是)了解这里出了什么问题。

编辑:用于测试的html网站是这一个:https://docs.python.org/3/library/stdtypes.html,错误可以在章节4.7.1字符串方法中找到(那里有很多方法)

EDIT_2:但我不只是使用那个html网站,而是部分内容。 html-parts存储在一个列表中,我只想让dl-elements被删除&#34;如果它们不是第一个html元素(所以我想保留元素,如果它是头部)。

这就是我的代码实际看起来的样子:

from bs4 import BeautifulSoup, NavigableString

soup = BeautifulSoup(open("/home/sven/Bachelorarbeit/python-doc-extractor-for-cado/extractor-application/index.html"))
f = open('test.html','w')    #needs to exist
f.truncate
matches=[]

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements

sections = soup.find_all(['div'], attrs = {'class':'section'})   #grab all section-elements

matches = dl_elems + sections   #merge the lists to get all results

for elem in matches:
  for child in elem.descendants:
      if not isinstance(child, NavigableString) and child.dl is not None  and 'method' in child.dl.get('class'):
           text = "<p>***removed something here***</p>"
           child.dl.replaceWith(BeautifulSoup(text))


print(matches,file=f)
f.close()

1 个答案:

答案 0 :(得分:1)

我们的想法是查找包含dl的所有class="method"元素,并将其替换为p标记:

import urllib2
from bs4 import BeautifulSoup, Tag

# get the html
url = "https://docs.python.org/3/library/stdtypes.html"
soup = BeautifulSoup(urllib2.urlopen(url))

# replace all `dl` elements with `method` class
for elem in soup('dl', class_='method'):
    tag = Tag(name='p')
    tag.string = '***removed something here***'
    elem.replace_with(tag)

print soup.prettify()

UPD(适应问题编辑):

dl_elems = soup.find_all(['dl'], attrs={'class': ['class', 'method','function','describe', 'classmethod', 'staticmethod']})   # grab all possible dl-elements
sections = soup.find_all(['div'], attrs={'class': 'section'})   #grab all section-elements

for parent in dl_elems + sections:
    for elem in parent.find_all('dl', {'class': 'method'}):
        tag = Tag(name='p')
        tag.string = '***removed something here***'
        elem.replace_with(tag)

print dl_elems + sections