美丽汤儿童的额外换行品

时间:2015-03-24 23:52:37

标签: python html beautifulsoup html-parsing

我在html片段中使用BeautifulSoup,如下所示:

 s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
            <span class="views-field views-field-title"> 
                <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
                </span> 
            </span>
            <span class="views-field views-field-created"> 
                <span class="field-content">Friday, March 20, 2015
                </span> 
           </span> 
</div>""" 

soup = BeautifulSoup(s)

为什么s.span只返回第一个span标记?

此外,s.contents返回长度为4的列表。两个span标记都在列表中,但第0和第2个索引是&#34; \ n $新行字符。新行字符无用。有没有理由这样做?

1 个答案:

答案 0 :(得分:3)

  

为什么s.span只返回第一个span标记?

s.spans.find('span')的快捷方式,它只会找到{em> span标记的第一个。

  

此外,s.contents返回一个长度为4的列表。两个span标签都在列表中,但是第0和第2个索引是“\ n $ new line characters。新行字符没用。有没有理由说这是做什么?

根据定义,.contents输出所有元素子元素的列表,包括文本节点 - NavigableString class的实例。

如果您只想要标签,可以使用find_all()

soup.find_all()

而且,如果只有span标签:

soup.find_all('span')

示例:

>>> from bs4 import BeautifulSoup
>>> s = """<div class="views-row views-row-1 views-row-odd views-row-  first">
...             <span class="views-field views-field-title"> 
...                 <span class="field-content"><a href="/party-pictures/2015/love-heals">Love Heals</a>
...                 </span> 
...             </span>
...             <span class="views-field views-field-created"> 
...                 <span class="field-content">Friday, March 20, 2015
...                 </span> 
...            </span> 
... </div>""" 
>>> soup = BeautifulSoup(s)
>>> for span in soup.find_all('span'):
...     print span.text.strip()
... 
Love Heals
Love Heals
Friday, March 20, 2015
Friday, March 20, 2015

重复的原因是有嵌套的span元素。您可以通过不同方式修复它。例如,您只能使用div

recursive=False内进行搜索
>>> for span in soup.find('div', class_='views-row-1').find_all('span', recursive=False):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015

或者,您可以使用CSS Selectors

>>> for span in soup.select('div.views-row-1 > span'):
...     print span.text.strip()
... 
Love Heals
Friday, March 20, 2015