我在html片段中使用BeautifulSoup
,如下所示:
s = """<div class="views-row views-row-1 views-row-odd views-row- first">
<span class="views-field views-field-title">
<span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
</span>
</span>
<span class="views-field views-field-created">
<span class="field-content">Friday, March 20, 2015
</span>
</span>
</div>"""
soup = BeautifulSoup(s)
为什么s.span
只返回第一个span标记?
此外,s.contents返回长度为4的列表。两个span标记都在列表中,但第0和第2个索引是&#34; \ n $新行字符。新行字符无用。有没有理由这样做?
答案 0 :(得分:3)
为什么s.span只返回第一个span标记?
s.span
是s.find('span')
的快捷方式,它只会找到{em> span
标记的第一个。
此外,s.contents返回一个长度为4的列表。两个span标签都在列表中,但是第0和第2个索引是“\ n $ new line characters。新行字符没用。有没有理由说这是做什么?
根据定义,.contents
输出所有元素子元素的列表,包括文本节点 - NavigableString
class的实例。
如果您只想要标签,可以使用find_all()
:
soup.find_all()
而且,如果只有span
标签:
soup.find_all('span')
示例:
>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row- first">
... <span class="views-field views-field-title">
... <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
... </span>
... </span>
... <span class="views-field views-field-created">
... <span class="field-content">Friday, March 20, 2015
... </span>
... </span>
... </div>"""
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
... print span.text.strip()
...
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015
重复的原因是有嵌套的span
元素。您可以通过不同方式修复它。例如,您只能使用div
recursive=False
内进行搜索
>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015
或者,您可以使用CSS Selectors
:
>>> for span in soup.select('div.views-row-1 > span'):
... print span.text.strip()
...
Love Heals
Friday, March 20, 2015