BeautifulSoup - 组合连续标签

时间:2018-04-25 15:38:01

标签: python html beautifulsoup

我必须使用最脏的HTML,其中单个单词被拆分为单独的标记,如下例所示:

<b style="mso-bidi-font-weight:normal"><span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span></b><b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span></b>

这有点难以理解,但基本上“简介”这个词被分成了

<b><span>I</span></b> 

<b><span>NTRODUCTION</span></b>

对span和b标签具有相同的内联属性。

将这些结合起来的好方法是什么?我想我会循环查找这样的连续b标签,但我仍然坚持要合并连续的b标签。

for b in soup.findAll('b'):
    try:
       if b.next_sibling.name=='b':
       ## combine them here??
    except:
        pass

有什么想法吗?

修改 预期输出如下

<b style="mso-bidi-font-weight:normal"><span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>INTRODUCTION</span></b>

2 个答案:

答案 0 :(得分:3)

也许您可以检查b.previousSibling是否为b标记,然后将当前节点的内部文本追加到该标记中。执行此操作后 - 您应该能够使用b.decompose从树中删除当前节点。

答案 1 :(得分:2)

以下解决方案将所有选定<b>代码中的文字合并为您选择的<b>个,并分解其他代码。

如果您只想合并连续标签中的文字,请按照Danny's方法进行操作。

<强>代码:

from bs4 import BeautifulSoup

html = '''
<div id="wrapper">
  <b style="mso-bidi-font-weight:normal">
    <span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
  </b>
  <b style="mso-bidi-font-weight:normal">
    <span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
  </b>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper')  # it contains b tags to combine
b_tags = container.find_all('b')

# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)

# here you choose a tag you want to preserve and update its text
b_main = b_tags[0]  # you can target it however you want, I just take the first one from the list
b_main.span.string = text  # replace the text

for tag in b_tags:
    if tag is not b_main:
        tag.decompose()

print(soup)

任何评论都表示赞赏。