I am scraping stock data from Yahoo! Finance. My question is, using LinkExtractor
, how can I combine all the pricing data for a given stock if there are multiple pages of data for each stock.
class DmozSpider(CrawlSpider):
name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['https://ca.finance.yahoo.com/q/hp?s=CAT&a=04&b=24&c=2005&d=04&e=24&f=2015&g=d']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[@align="right"]/a[@rel="next"]'),
callback='stocks1',
follow=True),
]
This start_url
will have many pages of data, so I use the rule to follow each page. def stocks1
will then gather the data from a given page.
def stocks1(self, response):
returns_pages = []
rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
current_page = response.url
for row in rows:
cells = row.xpath('.//td/text()').extract()
try:
values = cells[-1]
try:
float(values)
returns_pages.append(values)
except ValueError:
continue
except ValueError:
continue
yield Request(current_page, self.finalize_stock, meta={'returns_pages': returns_pages})
The data for each item is stored through another function:
def finalize_stock(self, response):
returns_pages = response.meta.get('returns_pages')
item = Website()
items = []
item['returns'] = returns_pages
item['avg_returns'] = numpy.average(returns_pages)
items.append(item)
yield item
My question is how can I compile the returns from multiple pages for a single item, so that I can store it through finalize_stock
?
答案 0 :(得分:0)
当Scrapy将项目存储在内存中时,您可以使用ItemPipeline元素执行相同的操作:处理的每个项目都会通过管道类,您可以将这些项目存储在内存中(按列表,字典或您希望的方式)。
完成后,您可以将这些项目导出到CSV / JSON文件或数据库中。
有关示例,请查看Scrapy文档:http://doc.scrapy.org/en/latest/topics/item-pipeline.html