以下是我要清理的一段html示例:
<figure class="floatRight" style="margin-left: 30px">
<a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a>
<figcaption></figcaption>
</figure>
<p>
<a name="N65743"></a>
</p><h3>Abstract</h3>
<p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p>
<div class="articleKeywords">
<a name="N65760"></a>
<h3>Key words</h3>
5-iodoindolizines -
Sonogashira reaction -
5-ethynylindolizine -
X-ray
</div>
<a name="N67312"></a>
<h3>Supporting Information</h3>
<ul class="linkList">Supporting information for this article is available online at http://dx.doi.org/10.1055/s-0034-1378861.<li>
<a class="gotolink" href="https://www.thieme-connect.de/media/synthesis/EFirst/supmat/sup_ss-2015-c0259-st_10-1055_s-0034-1378861.pdf">Supporting Information</a>
</li>
</ul>
我基本上会做的是这样的事情:
from bs4 import BeautifulSoup
with open("test.xml", 'r') as file:
soup = BeautifulSoup(file.read(), "lxml")
abstract = soup
[tag.extract() for tag in abstract("a", attrs={"name": True})]
[tag.extract() for tag in abstract("h3")]
[tag.extract() for tag in abstract("ul", attrs={"class": "linkList"})]
[tag.extract() for tag in abstract("a", attrs={"class": "gotolink"})]
print(abstract)
我希望多个extract()行清除每个匹配的标记。但是,只有第一个有效!我可以摆脱&#34; a&#34;标签,但不是&#34; h3&#34;标签。如果我评论第一个提取行(&#34; a&#34;标记的那个),我可以摆脱&#34; h3&#34;标签,但不是其他标签。
有点奇怪。你知道我为什么会这样做吗?
我使用从pip
新安装的BeautifulSoup4 4.4.0答案 0 :(得分:0)
诀窍是在每次提取后创建一个新的Beautiful Soup对象,并在这个新对象上执行下一个提取。
这可能看起来有点难看,但它确实有效:
<强> clean.py 强>
from bs4 import BeautifulSoup
with open("test.xml", 'r') as file:
soup = BeautifulSoup(file.read(), "lxml")
abstract = soup
[tag.extract() for tag in abstract("a", attrs={"name": True})]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("h3")]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("ul", attrs={"class": "linkList"})]
abstract = BeautifulSoup(str(abstract))
[tag.extract() for tag in abstract("a", attrs={"class": "gotolink"})]
print(abstract)
输出
清洁前
(bs4extract)macbook:bs4extract joeyoung$ cat test.xml
<figure class="floatRight" style="margin-left: 30px">
<a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a>
<figcaption></figcaption>
</figure>
<p>
<a name="N65743"></a>
</p><h3>Abstract</h3>
<p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p>
<div class="articleKeywords">
<a name="N65760"></a>
<h3>Key words</h3>
5-iodoindolizines -
Sonogashira reaction -
5-ethynylindolizine -
X-ray
</div>
<a name="N67312"></a>
<h3>Supporting Information</h3>
<ul class="linkList">Supporting information for this article is available online at http://dx.doi.org/10.1055/s-0034-1378861.<li>
<a class="gotolink" href="https://www.thieme-connect.de/media/synthesis/EFirst/supmat/sup_ss-2015-c0259-st_10-1055_s-0034-1378861.pdf">Supporting Information</a>
</li>
</ul>
清洁后
(bs4extract)macbook:bs4extract joeyoung$ python clean.py
<html><body><figure class="floatRight" style="margin-left: 30px">
<a class="zoomFunction alignLeft" href="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"><img src="https://www.thieme-connect.de/media/synthesis/EFirst/lookinside/thumbnails/ss-2015-c0259-st_10-1055_s-0034-1378861-1.jpg"/></a>
<figcaption></figcaption>
</figure>
<p>
</p>
<p>2-<span class="i">tert</span>-Butyl-5-iodoindolizine underwent Sonogashira reaction with acetylenes in the presence of dichlorobis(triphenylphosphine)palladium, copper(I) iodide, and triethylamine in acetonitrile to give to the corresponding 5-ethynylindolizines in high yields; 5-iodo-2-phenylindolizine and 5-bromo-2-<span class="i">tert</span>-butylindolizine did not undergo the reaction. Several structures were characterized by X-ray. The 5-ethynylindolizines did not undergo cyclization to give cycl[3.2.2]azines.</p>
<div class="articleKeywords">
5-iodoindolizines -
Sonogashira reaction -
5-ethynylindolizine -
X-ray
</div>
</body></html>
答案 1 :(得分:0)
对不起,伙计们,这个bug实际上来自BeautifulSoup。降级到4.3.2-3时,完全相同的代码可以正常工作。我会举报。对不起,在发布之前我没有检查过。