使用scrapy从表中提取数据

时间:2015-07-07 09:21:06

标签: web-crawler scrapy

我从这个网站提取数据: http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907&exportType=csv

我正在使用scrapy来提取其中所述的GDP数据。我的代码如下:

class DmozSpider(Spider):
name = "gdp"
allowed_domains = ["singstat.gov.org"]
start_urls = [
    "http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907&exportType=csv",
]

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]')
    items = []

    for site in sites:
        item = Website()
        item['value'] = site.xpath('td[@class="TextRowRight divResize"]/text()').extract()
        items.append(item)

    return items

我从tr class =“RowOdd”开始,td class =“TextRow1 divResize”并慢慢到达最后一个标签。之后,我应该得到gdp值= 654.8,但文件中没有任何内容。

请帮助谢谢!

1 个答案:

答案 0 :(得分:1)

它实际上是一个获取数据的表单请求,

从scrapy shell看到演示

In [1]: from scrapy.http import FormRequest 

In [2]: url = 'http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907'

In [3]: form_data = {
'struts.token.name':'token',
'token':'1Z6NS7BWFOD0ARQH160MLHZ1FZF5ZKBB',
'titleId':'4646',
'subjectName':'',
'topicName':'',
'titleName':'M013751 - Gross Domestic Product At Current Market Prices, By Industry, Quarterly',
'glosDiv':'',
'groupVarIds':'1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;',
'selectedVarIds': ['93204','93350','93180','93376','93158','93272','93087','92926','93275','93144','93248','93170','92946','93065','93136','93283','93296'],
'selectedTimeIds': ['702','703','704','705','706','707','708','709','710','711','712','713','714','715','716','717','718','719','720','721','722','723','724','725','726','727','728','729','730','731','732','733','734','735','736','737','738','739','740','741','742','743','744','745','746','747','748','749','750','751','752','753','754','755','756','757','758','759','760','761','762','763','764','765','211','212','213','214','187','188','189','190','191','192','193','194','195','196','197','198','199','200','201','202','203','204','205','206','207','208','209','210','104','105','106','107','108','109','110','111','112','113','114','115','116','117','118','119','120','121','122','123','124','125','126','127','128','129','130','131','132','133','134','135','136','137','138','139','140','141','142','143','144','103','21','22','23','24','25','26','27','28','29','30','31','32','33','34','1','2','3','4','5','6','17','175','179','184','217','222','943'],
}

In [4]: request_object = FormRequest(url=url, formdata=form_data)

In [5]: fetch(request_object) 
2015-07-07 17:06:24+0530 [default] INFO: Spider opened
2015-07-07 17:06:26+0530 [default] DEBUG: Crawled (200) <POST http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f77a9293250>
[s]   item       {}
[s]   request_object <POST http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907>
[s]   request    <POST http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907>
[s]   response   <200 http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907>
[s]   settings   <scrapy.settings.Settings object at 0x7f77b11e8450>
[s]   spider     <Spider 'default' at 0x7f77a6274b50>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [6]: sites = response.xpath('//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]')   

In [7]: sites
Out[7]: 
[<Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>,
 <Selector xpath='//tr[@class="RowOdd"]/td[@class="TextRow1 divResize"]/div[@class="varNameClass"]' data=u'<div class="varNameClass" style="display'>]

示例输出将是

table_headers = response.xpath('//tr[@class="BackGroundTRHeader"]/th/a/text()').extract()
table_headers = [i.strip() for i in table_headers if i.strip()][1:]
values = response.xpath('//tr[td[div[@class="varNameClass" and contains(text(), "Gross Domestic Product At Current Market Prices")]]]/td[@class="TextRowRight divResize"]')
gdp_values = []
for value in values:
    data = value.xpath('.//text()').extract()
    data = data[0].strip() if data else ''
    gdp_values.append(data)

processed_data = dict(zip(table_headers, gdp_values))

如果您打印已处理的数据,

{u'1975 1Q': u'3,209.2',
 u'1975 2Q': u'3,306.9',
 u'1975 3Q': u'3,519.1',
 u'1975 4Q': u'3,692.8',
 u'1976 1Q': u'3,544.8',
 u'1976 2Q': u'3,619.9',
 u'1976 3Q': u'3,846.2',
 u'1976 4Q': u'3,991',
 u'1977 1Q': u'3,872.8',
 u'1977 2Q': u'3,962.4',
 u'1977 3Q': u'4,171.8',
 u'1977 4Q': u'4,343.6',
 u'1978 1Q': u'4,283.5',
 u'1978 2Q': u'4,376.2',
 u'1978 3Q': u'4,694.9',
 u'1978 4Q': u'4,979.5',
 u'1979 1Q': u'4,767.3',
 u'1979 2Q': u'5,060.3',
 u'1979 3Q': u'5,459.9',
 u'1979 4Q': u'5,848.5',
 u'1980 1Q': u'6,015.7',
 u'1980 2Q': u'6,245.1',
 u'1980 3Q': u'6,616.4',
 u'1980 4Q': u'6,986.2',
 u'1981 1Q': u'6,972.6',
 u'1981 2Q': u'7,296.5',
 u'1981 3Q': u'7,843.4',
 u'1981 4Q': u'8,232.2',
 u'1982 1Q': u'8,062.2',
 u'1982 2Q': u'8,196.3',
 u'1982 3Q': u'8,613.8',
 u'1982 4Q': u'9,097.5',
 u'1983 1Q': u'8,811.7',
 u'1983 2Q': u'9,324.9',
 u'1983 3Q': u'9,792.7',
 u'1983 4Q': u'10,109.8',
 u'1984 1Q': u'9,925.5',
 u'1984 2Q': u'10,219.3',
 u'1984 3Q': u'10,661',
 u'1984 4Q': u'10,896.2',
 u'1985 1Q': u'10,430.6',
 u'1985 2Q': u'10,171.1',
 u'1985 3Q': u'10,210.2',
 u'1985 4Q': u'10,012',
 u'1986 1Q': u'9,757.9',
 u'1986 2Q': u'9,952',
 u'1986 3Q': u'10,279.7',
 u'1986 4Q': u'10,864.7',
 u'1987 1Q': u'10,418.7',
 u'1987 2Q': u'10,971.1',
 u'1987 3Q': u'11,712.3',
 u'1987 4Q': u'12,400.4',
 u'1988 1Q': u'12,092.2',
 u'1988 2Q': u'12,837.7',
 u'1988 3Q': u'13,785.3',
 u'1988 4Q': u'14,645',
 u'1989 1Q': u'13,934.7',
 u'1989 2Q': u'14,834.3',
 u'1989 3Q': u'15,768.8',
 u'1989 4Q': u'16,686.6',
 u'1990 1Q': u'16,678',
 u'1990 2Q': u'17,091',
 u'1990 3Q': u'17,932.5',
 u'1990 4Q': u'18,805.8',
 u'1991 1Q': u'18,529.5',
 u'1991 2Q': u'19,061.1',
 u'1991 3Q': u'20,153',
 u'1991 4Q': u'20,813.5',
 u'1992 1Q': u'19,964.2',
 u'1992 2Q': u'20,392.8',
 u'1992 3Q': u'21,688.7',
 u'1992 4Q': u'22,917.1',
 u'1993 1Q': u'22,574.9',
 u'1993 2Q': u'23,650.6',
 u'1993 3Q': u'24,994.8',
 u'1993 4Q': u'26,769.2',
 u'1994 1Q': u'26,326.5',
 u'1994 2Q': u'26,984.7',
 u'1994 3Q': u'29,076.8',
 u'1994 4Q': u'30,300.2',
 u'1995 1Q': u'28,961',
 u'1995 2Q': u'30,135.2',
 u'1995 3Q': u'32,098.4',
 u'1995 4Q': u'33,380.7',
 u'1996 1Q': u'32,815.7',
 u'1996 2Q': u'33,215.7',
 u'1996 3Q': u'33,946.3',
 u'1996 4Q': u'35,951.6',
 u'1997 1Q': u'34,811.2',
 u'1997 2Q': u'36,854.3',
 u'1997 3Q': u'38,262.1',
 u'1997 4Q': u'38,795.9',
 u'1998 1Q': u'36,001.1',
 u'1998 2Q': u'35,849.6',
 u'1998 3Q': u'35,866.2',
 u'1998 4Q': u'35,723.4',
 u'1999 1Q': u'33,922.3',
 u'1999 2Q': u'36,198.4',
 u'1999 3Q': u'37,460.5',
 u'1999 4Q': u'38,668.7',
 u'2000 1Q': u'38,179.9',
 u'2000 2Q': u'39,924.5',
 u'2000 3Q': u'42,863.4',
 u'2000 4Q': u'44,249.9',
 u'2001 1Q': u'41,206.6',
 u'2001 2Q': u'39,935.7',
 u'2001 3Q': u'39,229.4',
 u'2001 4Q': u'39,602.4',
 u'2002 1Q': u'40,269.8',
 u'2002 2Q': u'41,052.3',
 u'2002 3Q': u'41,191.1',
 u'2002 4Q': u'42,116.7',
 u'2003 1Q': u'41,823.5',
 u'2003 2Q': u'39,845.9',
 u'2003 3Q': u'42,305.1',
 u'2003 4Q': u'45,021.3',
 u'2004 1Q': u'46,244.9',
 u'2004 2Q': u'46,348.7',
 u'2004 3Q': u'48,859.6',
 u'2004 4Q': u'51,548.3',
 u'2005 1Q': u'50,312.4',
 u'2005 2Q': u'51,013.6',
 u'2005 3Q': u'53,566.3',
 u'2005 4Q': u'57,181.7',
 u'2006 1Q': u'55,475',
 u'2006 2Q': u'56,931.4',
 u'2006 3Q': u'59,358.1',
 u'2006 4Q': u'63,070.5',
 u'2007 1Q': u'63,175.9',
 u'2007 2Q': u'66,580.6',
 u'2007 3Q': u'69,656.2',
 u'2007 4Q': u'71,837.1',
 u'2008 1Q': u'68,248.9',
 u'2008 2Q': u'67,898',
 u'2008 3Q': u'69,547.8',
 u'2008 4Q': u'66,285.7',
 u'2009 1Q': u'64,212.6',
 u'2009 2Q': u'67,670.2',
 u'2009 3Q': u'71,976.9',
 u'2009 4Q': u'75,998.3',
 u'2010 1Q': u'76,781.4',
 u'2010 2Q': u'80,025.8',
 u'2010 3Q': u'80,666.6',
 u'2010 4Q': u'84,887.3',
 u'2011 1Q': u'85,418.3',
 u'2011 2Q': u'85,495.4',
 u'2011 3Q': u'86,754.7',
 u'2011 4Q': u'88,685.1',
 u'2012 1Q': u'89,655.7',
 u'2012 2Q': u'90,819.3',
 u'2012 3Q': u'88,976.3',
 u'2012 4Q': u'92,881.2',
 u'2013 1Q': u'92,256.5',
 u'2013 2Q': u'93,680.9',
 u'2013 4Q': u'98,168.9',
 u'2014 1Q': u'96,368.3',
 u'2014 2Q': u'96,808.3',
 u'2014 3Q': u'97,382',
 u'2014 4Q': u'99,530.5',
 u'2015 1Q': u'98,743.3'}

修改

最终代码看起来像这样,

class DmozSpider(Spider):
    name = "gdp"
    allowed_domains = ["singstat.gov.org"]
    url = 'http://www.tablebuilder.singstat.gov.sg/publicfacing/createDataTable.action?refId=1907'

    def start_requests(self):
        form_data = {
                    'struts.token.name': 'token',
                    'token': '1Z6NS7BWFOD0ARQH160MLHZ1FZF5ZKBB',
                    'titleId': '4646',
                    'subjectName': '',
                    'topicName': '',
                    'titleName': 'M013751 - Gross Domestic Product At Current Market Prices, By Industry, Quarterly',
                    'glosDiv': '',
                    'groupVarIds': '1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;',
                    'selectedVarIds': ['93204', '93350', '93180', '93376', '93158', '93272', '93087', '92926', '93275', '93144', '93248', '93170', '92946', '93065', '93136', '93283', '93296'],
                    'selectedTimeIds': ['179', '184', '217', '222'],
                    }
        return [scrapy.FormRequest(self.url,
                                   formdata=form_data,
                                   callback=self.parse_data)]

    def parse_data(self, response):
        table_headers = response.xpath('//tr[@class="BackGroundTRHeader"]/th/a/text()').extract()
        table_headers = [i.strip() for i in table_headers if i.strip()][1:]
        values = response.xpath('//tr[td[div[@class="varNameClass" and contains(text(), "Gross Domestic Product At Current Market Prices")]]]/td[@class="TextRowRight divResize"]')
        gdp_values = []
        for value in values:
            data = value.xpath('.//text()').extract()
            data = data[0].strip() if data else ''
            gdp_values.append(data)
        processed_data = dict(zip(table_headers, gdp_values))

而不是通常的scrapy Request使用FormRequest,并将formdata更新为仅报废2014 Q1-Q4。确保您已更新了可从浏览器中获取的formdata中的token