Scrapy(Python):迭代' next'没有多个功能的页面

时间:2015-05-24 04:42:08

标签: python scrapy

我正在使用Scrapy从Yahoo!中获取股票数据金融。

有时,我需要在this example中循环显示多个页面,以获取所有库存数据。

以前(当我知道只有两页时),我会为每个页面使用一个函数,如下所示:

def stocks_page_1(self, response):

    returns_page1 = []

    #Grabs data here...

    current_page = response.url
    next_page = current_page + "&z=66&y=66"
    yield Request(next_page, self.stocks_page_2, meta={'returns_page1': returns_page1})

def stocks_page_2(self, response):

    # Grab data again...

现在,我不知道是不是写了19个或更多函数,而是想知道是否有一种方法可以使用一个函数来循环遍历迭代,从所有可用于给定股票的页面中获取所有数据。

这样的事情:

        for x in range(30): # 30 was randomly selected
            current_page = response.url
            # Grabs Data
            # Check if there is a 'next' page:
            if response.xpath('//td[@align="right"]/a[@rel="next"]').extract() != ' ': 
                u = x * 66
                next_page = current_page + "&z=66&y={0}".format(u)
                # Go to the next page somehow within the function???

更新代码:

可以使用,但只返回一页数据。

class DmozSpider(CrawlSpider):


name = "dnot"
allowed_domains = ["finance.yahoo.com", "http://eoddata.com/"]
start_urls = ['http://finance.yahoo.com/q?s=CAT']
rules = [
Rule(LinkExtractor(restrict_xpaths='//td[@align="right"]/a[@rel="next"]'),
     callback='stocks1',
     follow=True),
]
def stocks1(self, response):
        returns = []
        rows = response.xpath('//table[@class="yfnc_datamodoutline1"]//table/tr')[1:]
        for row in rows:
            cells = row.xpath('.//td/text()').extract()
            try:
                values = cells[-1]
                try:
                    float(values)
                    returns.append(values)
                except ValueError:
                    continue
            except ValueError:
                continue  

        unformatted_returns = response.meta.get('returns_pages')
        returns = [float(i) for i in returns]
        global required_amount_of_returns, counter
        if counter == 1 and "CAT" in response.url:
            required_amount_of_returns = len(returns)
        elif required_amount_of_returns == 0:
            raise CloseSpider("'Error with initiating required amount of returns'")

        counter += 1
        print counter

        # Iterator to calculate Rate of return 
        # ====================================
        if data_intervals == "m": 
            k = 12
        elif data_intervals == "w":
            k = 4
        else: 
            k = 30

        sub_returns_amount = required_amount_of_returns - k
        sub_returns = returns[:sub_returns_amount]
        rate_of_return = []

        if len(returns) == required_amount_of_returns or "CAT" in response.url:
            for number in sub_returns:
                numerator = number - returns[k]
                rate = numerator/returns[k]
                if rate == '': 
                    rate = 0
                rate_of_return.append(rate)
                k += 1

        item = Website()
        items = []
        item['url'] = response.url
        item['name'] = response.xpath('//div[@class="title"]/h2/text()').extract()
        item['avg_returns'] = numpy.average(rate_of_return)
        item['var_returns'] = numpy.cov(rate_of_return)
        item['sd_returns'] = numpy.std(rate_of_return)
        item['returns'] = returns
        item['rate_of_returns'] = rate_of_return
        item['exchange'] = response.xpath('//span[@class="rtq_exch"]/text()').extract()
        item['ind_sharpe'] = ((numpy.average(rate_of_return) - RFR) / numpy.std(rate_of_return))
        items.append(item)
        yield item

1 个答案:

答案 0 :(得分:2)

你知道,解析回调只是一个接收响应并返回或产生Item s或Request s或两者的函数。重用这些回调完全没有问题,所以你可以为每个请求传递相同的回调。

现在,您可以使用Request传递当前页面信息,但我会利用CrawlSpider抓取每个页面。这很简单,用命令行开始生成Spider

scrapy genspider --template crawl finance finance.yahoo.com

然后这样写:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

Scrapy 1.0已弃用上述模块的scrapy.contrib命名空间,但如果您坚持使用0.24,请使用scrapy.contrib.linkextractorsscrapy.contrib.spiders

from yfinance.items import YfinanceItem


class FinanceSpider(CrawlSpider):
    name = 'finance'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']

    rules = (
        Rule(LinkExtractor(restrict_css='[rel="next"]'),
             callback='parse_items',
             follow=True),
    )

LinkExtractor将获取响应中的链接,但可以使用XPath(或CSS)和正则表达式进行限制。有关详情,请参阅documentation

Rule将关注链接并在每个回复时调用callbackfollow=True将继续提取每个新响应的链接,但它可能受到深度的限制。再次查看documentation

    def parse_items(self, response):
        for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
            yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])

只需获得Item s,因为下一页的Request将由CrawlSpider Rule处理。