Scrape all pages for a particular item in scrapy

时间:2015-05-24 20:30:46

标签: python scrapy

I am scraping stock data from Yahoo! Finance. My question is, using LinkExtractor, how can I combine all the pricing data for a given stock if there are multiple pages of data for each stock.

class DmozSpider(CrawlSpider):

    name = "dnot"
    allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
    start_urls = ['https://ca.finance.yahoo.com/q/hp?s=CAT&a=04&b=24&c=2005&d=04&e=24&f=2015&g=d']
    rules = [
    Rule(LinkExtractor(restrict_xpaths='//td[@align="right"]/a[@rel="next"]'),
     callback='stocks1',
     follow=True),
]

This start_url will have many pages of data, so I use the rule to follow each page. def stocks1 will then gather the data from a given page.

    def stocks1(self, response):

    returns_pages = []
    rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
    current_page = response.url
    for row in rows:
        cells = row.xpath('.//td/text()').extract()
        try:
            values = cells[-1]
            try:
                float(values)
                returns_pages.append(values)
            except ValueError:
                continue
        except ValueError:
            continue  

     yield Request(current_page, self.finalize_stock, meta={'returns_pages': returns_pages})

The data for each item is stored through another function:

def finalize_stock(self, response):

    returns_pages =  response.meta.get('returns_pages')
    item = Website()
    items = []
    item['returns'] = returns_pages
    item['avg_returns'] = numpy.average(returns_pages)
    items.append(item)
    yield item

My question is how can I compile the returns from multiple pages for a single item, so that I can store it through finalize_stock?

1 个答案:

答案 0 :(得分:0)

当Scrapy将项目存储在内存中时,您可以使用ItemPipeline元素执行相同的操作:处理的每个项目都会通过管道类,您可以将这些项目存储在内存中(按列表,字典或您希望的方式)。

完成后,您可以将这些项目导出到CSV / JSON文件或数据库中。

有关示例,请查看Scrapy文档:http://doc.scrapy.org/en/latest/topics/item-pipeline.html