Scrapy规则如何与爬行蜘蛛一起使用

时间:2014-02-27 23:28:05

标签: python regex web-crawler scrapy

我很难理解scrapy爬行蜘蛛的规则。我的例子不能像我希望的那样工作,所以它可以是两件事:

  1. 我不明白规则如何运作。
  2. 我形成了错误的正则表达式,阻止我得到我需要的结果。
  3. 好的,这就是我想要做的事情:

    我想编写抓取蜘蛛,它将从http://www.euroleague.net网站获取所有可用的统计信息。 托管我开始所需的所有信息的网站页面是here

    第1步

    我想的第一步是提取“季节”链接并休息它。 这里是我想要匹配的HTML / href(我想逐个匹配“季节”部分中的所有链接,但我认为以一个链接为例会更容易):

    href="/main/results/by-date?seasoncode=E2001"
    

    这是我为它创建的规则/正则表达式:

    Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),
    

    enter image description here

    第2步

    当我被蜘蛛带到网页http://www.euroleague.net/main/results/by-date?seasoncode=E2001进行第二步时,我希望蜘蛛从“常规季节”部分中提取链接。在这种情况下,我们可以说它应该是“第1轮”。我正在寻找的HTML / href是:

    <a href="/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS"
    

    我构建的规则/正则表达式是:

    Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),
    

    enter image description here

    第3步

    现在我到达了页面(http://www.euroleague.net/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS)我准备提取链接,这些链接指向包含我需要的所有信息的页面: 我正在寻找HTML / href:

    href="/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2001#!boxscore"
    

    我必须遵循的正则表达方式是:

    Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),
    

    enter image description here

    问题

    我认为抓取工具应该是这样的: 规则爬虫就像一个循环。当第一个链接匹配时,爬虫将跟随“步骤2”页面,而不是“步骤3”,之后它将提取数据。完成后,它将返回“步骤1”以匹配第二个链接,并再次开始循环到第一步中没有链接的点。

    我从终端看到的似乎是爬虫在“步骤1”中循环。它遍历所有“步骤1”链接,但不涉及“步骤2”/“步骤3”规则。

    2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://  www.euroleague.net/main/results/by-date)
    2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 00:20:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 00:20:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)
    

    在循环遍历所有“季节”链接之后,它以我未看到的链接开始,在我提到的三个步骤中的任何一个步骤中:

    http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013
    

    只有循环浏览“步骤2”中的所有链接而不返回“步骤1”起点,才能找到这样的链接结构。

    问题是: 规则如何运作?它是否一步一步地工作,就像我打算它应该适用于这个例子,或者每个规则都有它自己的循环,并且只有在完成第一个规则的循环之后才从规则到规则?

    这就是我的看法。当然,我的规则/正则表达式可能有问题,而且非常有可能。

    以下是我从终端获得的所有内容:

    scrapy crawl basketsp_test -o item6.xml -t xml
    2014-02-28 01:09:20+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: basketbase)
    2014-02-28 01:09:20+0200 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
    2014-02-28 01:09:20+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'basketbase.spiders', 'FEED_FORMAT': 'xml', 'SPIDER_MODULES': ['basketbase.spiders'], 'FEED_URI': 'item6.xml', 'BOT_NAME': 'basketbase'}
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled item pipelines: Basketpipeline3, Basketpipeline1db
    2014-02-28 01:09:21+0200 [basketsp_test] INFO: Spider opened
    2014-02-28 01:09:21+0200 [basketsp_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-02-28 01:09:21+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-02-28 01:09:21+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
    2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Filtered duplicate request: <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> - no more duplicates will be shown (see DUPEFILTER_CLASS)
    2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:25+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2005> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2006> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2007> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2008> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2009> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:28+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2010> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2011> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2012> (referer: http://www.euroleague.net/main/results/by-date)
    2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=24&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=22&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=21&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=20&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=19&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=18&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=17&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=16&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=15&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:36+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=14&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=13&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=12&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:38+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=11&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=10&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=9&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=8&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=7&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:41+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=6&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=5&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=4&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:43+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=3&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=2&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=1&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
    2014-02-28 01:09:44+0200 [basketsp_test] INFO: Closing spider (finished)
    2014-02-28 01:09:44+0200 [basketsp_test] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 13663,
         'downloader/request_count': 39,
         'downloader/request_method_count/GET': 39,
         'downloader/response_bytes': 527838,
         'downloader/response_count': 39,
         'downloader/response_status_count/200': 39,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 2, 27, 23, 9, 44, 569579),
         'log_count/DEBUG': 46,
         'log_count/INFO': 3,
         'request_depth_max': 2,
         'response_received_count': 39,
         'scheduler/dequeued': 39,
         'scheduler/dequeued/memory': 39,
         'scheduler/enqueued': 39,
         'scheduler/enqueued/memory': 39,
         'start_time': datetime.datetime(2014, 2, 27, 23, 9, 21, 111255)}
    2014-02-28 01:09:44+0200 [basketsp_test] INFO: Spider closed (finished)
    

    以下是爬虫的规则部分:

    class Basketspider(CrawlSpider):
        name = "basketsp_test"
        download_delay = 0.5
    
        allowed_domains = ["www.euroleague.net"]
        start_urls = ["http://www.euroleague.net/main/results/by-date"]
        rules = (
            Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),
            Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),
            Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),
    
    
    
    )  
    

3 个答案:

答案 0 :(得分:15)

你是对的,根据source code,在将每个响应返回给回调函数之前,爬虫会从第一个开始循环遍历规则。在编写规则时,您应该记住它。例如,以下规则:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

永远不会应用第二条规则,因为第一条规则将使用 parse_item 回调提取所有链接。第二条规则的匹配将被scrapy.dupefilter.RFPDupeFilter过滤掉为重复项。你应该使用deny来正确匹配链接:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

答案 1 :(得分:8)

如果您来自中国,我有一篇关于此的中文博客文章:

别再滥用scrapy CrawlSpider中的follow=True

让我们看看这些规则是如何运作的:

def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response)
                 if lnk not in seen]
        for link in links:
            seen.add(link)
            r = Request(url=link.url, callback=self._response_downloaded)
            yield r

正如您所看到的,当我们关注链接时,响应中的链接将被所有规则使用for循环提取,然后将它们添加到设置对象中。

所有回复都将由self._response_downloaded处理:

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _parse_response(self, response, callback, cb_kwargs, follow=True):

    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item

    # follow will go back to the rules again
    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item

它会一次又一次地回到self._requests_to_follow(response)

总结:enter image description here

答案 2 :(得分:6)

我很想使用BaseSpider刮刀而不是爬虫。使用basespider,您可以拥有更多的预期请求路由流,而不是在页面上查找所有href并根据全局规则访问它们。使用yield Requests()继续循环遍历链接和回调的父集,以将输出对象一直传递到结尾。

根据您的描述:

  

我认为抓取工具应该是这样的:规则爬虫类似于循环。当第一个链接匹配时,爬虫将跟随“步骤2”页面,而不是“步骤3”,之后它将提取数据。完成后,它将返回“步骤1”以匹配第二个链接,并再次开始循环到第一步中没有链接的点。

这样的请求回调堆栈非常适合你。因为您知道页面的顺序以及需要刮擦的页面。这还有一个额外的好处,即能够在返回要处理的输出对象之前在多个页面上收集信息。

class Basketspider(BaseSpider, errorLog):
    name = "basketsp_test"
    download_delay = 0.5

    def start_requests(self):

        item = WhateverYourOutputItemIs()
        yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item})

    def parseSeaseonsLinks(self, response):

        item = response.meta['item'] 

        hxs = HtmlXPathSelector(response)

        html = hxs.extract()
        roundLinkList = list()

        roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS')

        for (roundLink) in re.findall(roundLinkPttern, html):
            if roundLink not in roundLinkList:
                roundLinkList.append(roundLink)        

        for i in range(len(roundLinkList)):

            #if you wanna output this info in the final item
            item['RoundLink'] = roundLinkList[i]

            # Generate new request for round page
            yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item})


    def parseRoundPAge(self, response):

        item = response.meta['item'] 
        #Do whatever you need to do in here call more requests if needed or return item here

        item['Thing'] = 'infoOnPage'
        #....
        #....
        #....

        return  item