Question

我有一个HTML文档，其中包含一堆<div>个，其中包含子<p>和<href>。目标是这个，

删除<div>和<p>代码
在每个已删除<div>的末尾添加</br>

示例

这样：

<div> 
  <p>
    <a href="" id="tnt1">[1]</a>"RFC 4456 - BGP Route Reflection: An Alternative to Full ... - IETF Tools.">ref="https://example.com">https://https://example.com"</a></span><span>. Accessed 15 Nov. 2017.
  </p>
</div>

成为这个：

<a href="" id="tnt1">[1]</a>"RFC 4456 - BGP Route Reflection: An Alternative to Full ... - IETF Tools.">ref="https://example.com">https://https://example.com"</a></span><span>. Accessed 15 Nov. 2017.
</br>

当前

到目前为止，我的代码是：

from bs4 import BeautifulSoup

for div in soup.find_all(name=re.compile(r'div')):
    print div

然而，所有示例似乎都指向替换内部文本而不是实际标记。此外，如果有一种方法可以在bs3中执行此操作，那将是理想的，因为我的所有其他代码当前都在使用v3。

任何人都能指出我正确的方向吗？谢谢，

Answer 1

''.join(str(x) for x in div.p.contents)在bs4上将内部HTML作为字符串

我让parent = div.parent稍后再使用

使用div.extract()我删除所有子标记的div。

使用parent.append()我放回内部HTML。

from bs4 import BeautifulSoup

data = '''<strong>
<div> 
  <p>
    <a href="" id="tnt1">[1]</a>"RFC 4456 - BGP Route Reflection: An Alternative to Full ... - IETF Tools.">ref="https://example.com">https://https://example.com"</a></span><span>. Accessed 15 Nov. 2017.
  </p>
</div>
</strong>'''

soup = BeautifulSoup(data, 'html.parser')

for div in soup.find_all('div'):
    parent = div.parent

    inner = ''.join(str(x) for x in div.p.contents) + "<br/>"
    print('--- inner ---')
    print(inner)

    # remove div with all subtags
    div.extract()

    parent.append(BeautifulSoup(inner, 'html.parser'))
    print('--- after ---')
    print(parent)

结果：

--- inner ---

<a href="" id="tnt1">[1]</a>"RFC 4456 - BGP Route Reflection: An Alternative to Full ... - IETF Tools.">ref="https://example.com">https://https://example.com"<br/>
--- after ---
<strong>

<a href="" id="tnt1">[1]</a>"RFC 4456 - BGP Route Reflection: An Alternative to Full ... - IETF Tools."&gt;ref="https://example.com"&gt;https://https://example.com"<br/></strong>

美丽的汤 - 删除外标签

1 个答案: