BeautifulSoup:嵌套元素

时间:2017-03-14 07:46:11

标签: python python-3.x web-scraping beautifulsoup

我想从此HTML文本(Springer期刊说明)中使用BeautifulSoup提取影响因子(0.806):

<div id="quick-facts-container" class="SideBox">
    <ul class="ListStack ListStack--float">
        <li>
            <span>Impact Factor</span>
            <span>0.806</span>
        </li>
        <li>
            <span>Available</span>
            <span>1996 - 2017</span>
        </li>
        <li>
            <span>Volumes</span>
            <span>22</span>
        </li>
        <li>
            <span>Issues</span>
            <span>265</span>
        </li>
    </ul>
</div>

因为它是嵌套的,我想获得第二个<span>的内容,我不知道该怎么做。

我的python脚本很简单:

from bs4 import BeautifulSoup
import urllib.request
r =urllib.request.urlopen('file:///197.html').read()
soup = BeautifulSoup(r, 'html.parser')

2 个答案:

答案 0 :(得分:0)

如果您只想要文档或标记的文本部分,则可以使用get_text()方法。它返回文档中或标记下的所有文本,作为单个Unicode字符串:

In [6]: for li in soup.find('div', id='quick-facts-container').find_all('li'):
   ...:     print(li.get_text(strip=True))
   ...:     
Impact Factor0.806
Available1996 - 2017

答案 1 :(得分:0)

以下内容应该有效:

from bs4 import BeautifulSoup

r = urllib.request.urlopen('file:///197.html').read()
soup = BeautifulSoup(r, 'html.parser')

data = [i.text for i in soup.find(id='quick-facts-container').li.find_all('span')]
print("{} ({})".format(data[0], data[1]))

将显示:

Impact Factor (0.806)