如何在Beautiful Soup 4中将包含子标签的标签与空标签分开?

时间:2018-12-25 18:58:31

标签: python html beautifulsoup

<a id="filepos10190"></a>
<a id="filepos10190">

<font size="6" color="#002984"><b>abashed </b></font> <div width="9"><i> 
<font color="green"> adj.</font></i></div> <div width="18"><font 
color="chocolate"><b>VERBS </b></font></div> <div width="27"><font 
color="gray">▪</font> <font color="darkslateblue"><b>be</b></font>, <font 
color="darkslateblue"><b>look</b></font></div> <div width="18"><font 
color="chocolate"><b>ADVERB </b></font></div> <div width="27"><font 
color="gray">▪</font> <font color="darkslateblue"><b>a little</b></font>, 
<font color="darkslateblue"><b>slightly</b></font>, <font 
color="darkslateblue"><b>etc.</b></font></div> <div width="27"><font 
color="gray">▪</font> <font color="darkslateblue"><b>suitably</b></font> 
</div> <div width="36"><font color="lightgray">▪</font> <span><font 
color="#595959">He glanced at Juliet accusingly and she looked suitably 
<u>~</u>.</font></span></div> 

</a>

这里有两个锚标记,一个没有任何内部标记,而另一个带有很多子标记。如果我只想要其中一个带标签的标签,该如何在抓取时将这两个标签分开?

2 个答案:

答案 0 :(得分:1)

from bs4 import BeautifulSoup

content="""
<a id="filepos10190"></a>
<a id="filepos10190">

<font size="6" color="#002984"><b>abashed </b></font> <div width="9"><i>
<font color="green"> adj.</font></i></div> <div width="18"><font
color="chocolate"><b>VERBS </b></font></div> <div width="27"><font
color="gray">▪</font> <font color="darkslateblue"><b>be</b></font>, <font
color="darkslateblue"><b>look</b></font></div> <div width="18"><font
color="chocolate"><b>ADVERB </b></font></div> <div width="27"><font
color="gray">▪</font> <font color="darkslateblue"><b>a little</b></font>,
<font color="darkslateblue"><b>slightly</b></font>, <font
color="darkslateblue"><b>etc.</b></font></div> <div width="27"><font
color="gray">▪</font> <font color="darkslateblue"><b>suitably</b></font>
</div> <div width="36"><font color="lightgray">▪</font> <span><font
color="#595959">He glanced at Juliet accusingly and she looked suitably
<u>~</u>.</font></span></div>

</a>"""

soup = BeautifulSoup(content, 'html.parser')
tags = soup.find_all('a')  # just to filter your desire tag in this case anchor tag
filtered_tag = [i for i in tags if list(i.children)]  # results tags if it has child tags inside it otherwise empty list

答案 1 :(得分:1)

您实际上可以一次完成:

soup.find_all(lambda tag: tag.name == 'a' and tag.find())

tag.find()会尝试在tag中查找任何元素,而只有一个元素。