Question

This是我要使用BeautifulSoup从以下网站（https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination）解析的源代码的图像。我想在＆lt;中提取属性。 span class =＆＃39; print-only＆＃39;＆gt; attribute：htm链接。

我的python代码如下：

import urllib.request                                                                                                                                              

try:                                                                                                                                                
    from BeautifulSoup import BeautifulSoup                                                                                                                          
except ImportError:                                                                                                                                                    
    from bs4 import BeautifulSoup  

url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination"
with urllib.request.urlopen(url) as page:
     html_source = page.read()
soup = BeautifulSoup(html_source, 'html5lib')
link = soup.findAll("span", {"class":"print-only"})

打印＆＃39;链接＆＃39;返回一个空列表。我知道html代码中有span元素，因为soup.findAll（＆＃34; span＆＃34;）返回html代码（虽然在这些span元素的内容中没有，我看到一个名为＆＃39; print的类-only＆＃39;。）

我注意到Firefox开发人员窗口中的span属性显示为灰色。快速谷歌搜索显示这意味着该属性是隐藏的。这是否意味着无法使用我正在使用的方法获得它？

Answer 1

这是一个使用BeautifulSoup获得所需内容的解决方案，首先让我们来看看：

table = soup.find("table",{'id':'GridView1'})

现在我们在其正文中找到tr标记：

>>> table.find('tbody').findAll('tr')[0]
<tr>
                <td class="text-center">
                    2009-2010
                </td><td class="text-left">Arthritis Body Measures</td><td class="text-center">
                    <a href="/Nchs/Nhanes/2009-2010/ARX_F.htm">ARX_F Doc</a>
                </td><td class="text-center">
                    <a href="/Nchs/Nhanes/2009-2010/ARX_F.XPT">ARX_F Data [XPT - 510.5 KB]</a>
                </td><td class="text-center">
                    September, 2011
                </td>
            </tr>

请注意，您所寻找的标签不在那里。我展示了列表的第一项，以便您可以更好地分析您需要的网址，我们可以看到，它是我们想要的第一个a标记，例如：

>>> table.find('tbody').findAll('tr')[0].find('a')
<a href="/Nchs/Nhanes/2009-2010/ARX_F.htm">ARX_F Doc</a>

现在剩下要做的就是编写一个列表解析来加入列表中每个href标记中第一个a标记的所有tr属性：

>>> trList = table.find('tbody').findAll('tr')
>>> lst = [tr.find('a')['href'] for tr in trList]

如果我们打印lst的第一个元素，我们会看到这是我们想要的输出：

>>> lst[:3]
['/Nchs/Nhanes/2009-2010/ARX_F.htm', '/Nchs/Nhanes/1999-2000/AUX1.htm', '/Nchs/Nhanes/2001-2002/AUX_B.htm']

Answer 2

由于span元素被隐藏，因此您无法使用BeautifulSoup检索它。也许，您可以使用其他一些属性来获取所需的链接。如果您知道要为其提取链接的.htm文件的名称，您只需找到＆＃39; a＆＃39; element（使用内部文本绑定所需的链接和隐藏的span元素）并只提取＆＃39; href＆＃39;从元素如下：

import requests
from bs4 import BeautifulSoup
import html5lib
import string

ascii = set(string.printable)
def remove_non_ascii(s):
    return filter(lambda x: x in ascii, s)


url = 'https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination'
home_url = 'https://wwwn.cdc.gov'

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
page = requests.get(url, headers = headers, allow_redirects = True)
soup = BeautifulSoup(remove_non_ascii(page.text), "html5lib")

link = soup.find_all('a', text='ARX_F Doc')[0]
complete_url = home_url + link.get('href')
print complete_url

Answer 3

试试这个：

import urllib.request                                                                                                                                              
from bs4 import BeautifulSoup                                                                                                                             
url = "https://wwwn.cdc.gov/nchs/nhanes/search/datapage.aspx?Component=Examination"
with urllib.request.urlopen(url) as page:
     html_source = page.read()
soup = BeautifulSoup(html_source, 'html5lib')

link = soup.find_all("span", class_="print-only")

无法使用BeautifulSoup从span元素收集属性

3 个答案: