最好的方式来获得' hrefs'来自BeautifulSoup中的CSS选择器?

时间:2015-11-07 22:15:11

标签: python css beautifulsoup

编写一个脚本,该脚本最初会刮取给定人口普查区块组中所有人口普查区块的数据。但是,为了做到这一点,我首先需要能够获得给定区域中所有块组的链接。管道由带有URL的列表定义,它返回一个页面,列出了css选择器中的块组" div#rList3 a"。当我运行此代码时:

from bs4 import BeautifulSoup
from urllib.request import urlopen

tracts = ['http://www.usa.com/NY023970800.html','http://www.usa.com/NY023970900.html',
       'http://www.usa.com/NY023970600.html','http://www.usa.com/NY023970700.html',
       'http://www.usa.com/NY023970500.html']

class Scrape:
    def scrapeTracts(self):
        for i in tracts:
            html = urlopen(i)
            soup = BeautifulSoup(html.read(), 'lxml')
            bgs = soup.select("div#rList3 a")
            print(bgs)

s = Scrape()
s.scrapeTracts()

这给了我一个看起来像[<a href="/NY0239708001.html">NY0239708001</a>]的输出(为了这篇文章的篇幅,切出了真正的链接数量。)我的问题是,我怎样才能得到 < / em>&#39; href&#39;之后的字符串,在这种情况下是/NY0239708001.html

2 个答案:

答案 0 :(得分:2)

每个节点都有一个attrs字典,其中包含该节点的属性...包括CSS类,或者在本例中为href。

hrefs = []
for bg in bgs:
    hrefs.append(bg.attrs['href'])

答案 1 :(得分:2)

您可以通过以下方式在一行中执行此操作:

bgs = [i.attrs.get('href') for i in soup.select("div#rList3 a")]

输出:

['/NY0239708001.html']
['/NY0239709001.html', '/NY0239709002.html', '/NY0239709003.html', '/NY0239709004.html']
['/NY0239706001.html', '/NY0239706002.html', '/NY0239706003.html', '/NY0239706004.html']
['/NY0239707001.html', '/NY0239707002.html', '/NY0239707003.html', '/NY0239707004.html', '/NY0239707005.html']
['/NY0239705001.html', '/NY0239705002.html', '/NY0239705003.html', '/NY0239705004.html']