我必须解析一些令人讨厌的政府创建的html(http://www.spokanecounty.org/detentionservices/inmateroster/detail2.aspx?sysid=84060),为了减轻我的痛苦,我想在文档中插入一些html片段,将一些内容包装成更容易消化的块。
然而,BS4逃脱了我试图插入的html字符串片段(<div class="case">
)并将其转换为:
<div class="case">
我正在解析的相关html是:
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/22/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121018261' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table>
<div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
</div>
<div style='width:45%; float:left;'>
<h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
</div>
<div style='width:45%;float:right; text-align:right;'>
<div>Added: 10/21/2012</div>
</div>
<div style='width:100%;clear:both;'>
<b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
</div>
<table class='bookinfo 121037010' style='width:100%;'>
<tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table>
Python代码如下所示:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for c in case_top:
c.insert_before(soup.new_string('<div class="case">'))
case_bottom = soup.find_all("table", class_="bookinfo")
for c in case_bottom:
c.insert_after(soup.new_string('</div'))
结果如下:
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/22/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121018261" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr><tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr></table></div><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;"> </div><div style="width:45%; float:left;"><h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2></div><div style="width:45%;float:right; text-align:right;"><div>Added: 10/21/2012</div></div><div style="width:100%;clear:both;"><b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court</div><table class="bookinfo 121037010" style="width:100%;"><tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE)<br/><b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr></table></div>
问题是,如何将未转义的html片段插入文档?
答案 0 :(得分:1)
您告诉BeautifulSoup插入字符串数据:
c.insert_before(soup.new_string('<div class="case">'))
然后,任何对HTML字符串数据不安全的内容都会被转义。您想要插入标记对象:
c.insert_before(soup.new_tag('div', **{'class': 'case'}))
这会创建一个新的子元素,它实际上不会包装任何东西。
如果你想将每个元素包装在那里,你可以使用Element.wrap()
method:
c.wrap(soup.new_tag('div', **{'class': 'case'}))
但这一次只适用于一个标签。
对于包装系列标签,唯一能做的就是移动标签;将位于一个地方的标签插入另一个地方有效地将它们移动过来:
case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
for case in case_top:
wrapper = soup.new_tag('div', **{'class': 'case'})
case.insert_before(wrapper)
while wrapper.next_sibling:
wrapper.append(wrapper.next_sibling)
if wrapper.find('table', class_='bookinfo'):
# moved over the bookinfo table, time to stop
break
然后,将case_top
元素一直移动到<table class="bookinfo">
元素的所有内容都移动到新的<div class="case">
元素中。
演示:
>>> from bs4 import BeautifulSoup
>>> import re
>>> sample = '''\
... <body>
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121018261>Case Number: 121018261</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/22/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> $2,000,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121018261' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br /> <b>Report Number:</b> 120160423 <b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... <tr><td><b>Charge 2 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=' target='_blank'>RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br /><b>Report Number:</b> 120160423<b style='margin-left:10px;'>Report Agency:</b> 002 - SPOKANE CITY</td></tr>
... </table>
...
... <div style='float:left; width:100%;border-top:solid 1px #666;height:5px;'>
...
... </div>
... <div style='width:45%; float:left;'>
... <h2 style='margin-top:0px;' rel=121037010>Case Number: 121037010</h2>
... </div>
... <div style='width:45%;float:right; text-align:right;'>
... <div>Added: 10/21/2012</div>
... </div>
... <div style='width:100%;clear:both;'>
... <b>Case Bond:</b> $150,000.00 <b style='margin-left:10px;'>Set By:</b> Spokane County Superior Court
... </div>
... <table class='bookinfo 121037010' style='width:100%;'>
... <tr><td><b>Charge 1 <a href='http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050' target='_blank'>RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br /> <b>Report Number:</b> 120345597 <b style='margin-left:10px;'>Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
... </table>
... </body>
... '''
>>> soup = BeautifulSoup(sample)
>>> case_top = soup.find_all(style=re.compile("border-top:solid 1px #666"))
>>> for case in case_top:
... wrapper = soup.new_tag('div', **{'class': 'case'})
... case.insert_before(wrapper)
... while wrapper.next_sibling:
... wrapper.append(wrapper.next_sibling)
... if wrapper.find('table', class_='bookinfo'):
... # moved over the bookinfo table, time to stop
... break
...
>>> soup.body
<body><div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121018261" style="margin-top:0px;">Case Number: 121018261</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/22/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $2,000,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121018261" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG <br/> <b>Report Number:</b> 120160423 <b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
<tr><td><b>Charge 2 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=" target="_blank">RCW: 9A.56.210</a>:</b> ROBBERY-2ND DEG<br/><b>Report Number:</b> 120160423<b style="margin-left:10px;">Report Agency:</b> 002 - SPOKANE CITY</td></tr>
</table></div>
<div class="case"><div style="float:left; width:100%;border-top:solid 1px #666;height:5px;">
</div>
<div style="width:45%; float:left;">
<h2 rel="121037010" style="margin-top:0px;">Case Number: 121037010</h2>
</div>
<div style="width:45%;float:right; text-align:right;">
<div>Added: 10/21/2012</div>
</div>
<div style="width:100%;clear:both;">
<b>Case Bond:</b> $150,000.00 <b style="margin-left:10px;">Set By:</b> Spokane County Superior Court
</div>
<table class="bookinfo 121037010" style="width:100%;">
<tr><td><b>Charge 1 <a href="http://apps.leg.wa.gov/rcw/default.aspx?cite=9A.44.050" target="_blank">RCW: 9A.44.050(1)(A)</a>:</b> RAPE-2ND(FORCIBLE) <br/> <b>Report Number:</b> 120345597 <b style="margin-left:10px;">Report Agency:</b> 001 - SPOKANE COUNTY</td></tr>
</table></div>
</body>