如何修复CrawlSpider重定向?

时间:2013-11-05 21:01:43

标签: python scrapy web-crawler

我正在尝试为此网站编写CrawlSpider:http://www.shams-stores.com/shop/index.php 这是我的代码:

import urlparse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from project.items import Product
import re



class ShamsStoresSpider(CrawlSpider):
    name = "shamsstores2"
    domain_name = "shams-stores.com"
    CONCURRENT_REQUESTS = 1

    start_urls = ["http://www.shams-stores.com/shop/index.php"]

    rules = (
            #categories
            Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="categories_block_left"]/div/ul/li/a'), unique=False), callback='process', follow=True),
            )

    def process(self,response):
        print response

这是我使用scrapy crawl shamsstores2时得到的回应

2013-11-05 22:56:36+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6081
2013-11-05 22:56:41+0200 [shamsstores2] DEBUG: Crawled (200) <GET http://www.shams-stores.com/shop/index.php> (referer: None)
2013-11-05 22:56:42+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=14&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=14&id_lang=1>
2013-11-05 22:56:42+0200 [shamsstores2] DEBUG: Filtered duplicate request: <GET http://www.shams-stores.com/shop/index.php?id_category=14&controller=category&id_lang=1> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=13&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=13&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=12&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=12&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=10&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=10&id_lang=1>
2013-11-05 22:56:43+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=9&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=9&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=8&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=8&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=7&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=7&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] DEBUG: Redirecting (301) to <GET http://www.shams-stores.com/shop/index.php?id_category=6&controller=category&id_lang=1> from <GET http://www.shams-stores.com/shop/index.php?controller=category&id_category=6&id_lang=1>
2013-11-05 22:56:44+0200 [shamsstores2] INFO: Closing spider (finished)

它命中从规则中提取的链接,这些链接重定向到其他一些链接然后停止而不执行函数:process。 我可以通过使用基础蜘蛛解决这个问题,但是我可以修复它并仍然使用爬虫吗?

1 个答案:

答案 0 :(得分:1)

问题不在于重定向。 Scrapy会在服务器建议进入备用位置并从那里获取页面时执行。

问题是你的“restrict_xpaths =('// div [@ id =”categories_block_left“] / div / ul / li / a')”对于所有访问过的网页,它只是提取同一组8个网址和过滤器他们是重复的。

P.S。我唯一不明白的是为什么scrapy只为一页提供信息。如果我找到原因,我会更新。

编辑:参考github.com/scrapy/scrapy/blob/master/scrapy/utils/request.py

首先,请求排队并存储指纹。接下来生成重定向的URL,并且当通过比较指纹检查它是否重复时,scrapy找到相同的指纹。 Scarpy找到相同的指纹,因为在示例中引用,根据scrapy,重定向的URL和原始URL的重新排序的查询字符串是相同的。

种类'利用'解决方案

rules = (
    #categories
        Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@id="categories_block_left"]/div/ul/li/a') ), callback='process', process_links= 'appendDummy', follow=True),

    def process(self,response):
        print 'response is called'
        print response

    def appendDummy(self, links):
        for link in links:
            link.url = link.url +"?dummy=true"
        return links

因为服务器忽略重定向网址中附加的虚拟对象,我们会愚弄指纹来处理原始请求和重定向请求以处理不同的内容。

另外一个解决方案是您自己重新排序process_link回调中的查询参数(在示例appendDummy中)。

其他解决方案可能是覆盖finger_print来区分这些类型的网址(我认为在一般情况下会出错,可能会在这里很好)或基于网址的简单指纹(仅适用于此情况)。

如果解决方案适合您,请告诉我。

P.S。 scrapy处理重新排序和原始网址的行为是正确的。我不明白服务器重定向到重新排序的查询字符串的原因。