Question

我搜索了很多主题，但似乎没有找到我的具体问题的答案。我为网站创建了一个爬行蜘蛛，它运行得很好。然后我做了一个类似的网站爬行类似的网站，但这次我有一个小问题。直到业务：

我的开始网址如下：www.example.com。该页面包含我想要应用我的蜘蛛的链接：

www.example.com/locationA
www.example.com/locationB
www.example.com/locationC

...

我现在有一个问题：每次当我输入开始网址时，它会自动重定向到www.example.com/locationA，并且我让蜘蛛工作的所有链接都包括

www.example.com/locationB
www.example.com/locationC ...

所以我的问题是我如何在返回的URL中包含www.example.com/locationA。我甚至得到了如下日志信息：

-2011-11-28 21：25：33 + 1300 [example.com] DEBUG：从http://www.example.com/>重定向（302）;

-2011-11-28 21：25：34 + 1300 [example.com] DEBUG：重定向（302）为（referer：None）

2011-11-28 21:25:37 + 1300 [example.com] DEBUG：重定向（302）为（referer：www.example.com/locationB）

从parse_item打印出来：www.example.com/locationB

...

我认为问题可能与此有关（引用者：无）。有人可以对此有所了解吗？

我已通过将起始网址更改为www.example.com/locationB来缩小此问题的范围。由于所有页面都包含所有位置的列表，这次我得到了我的蜘蛛工作：

-www.example.com/locationA

-www.example.com/locationC ...

在一个坚果shell中，我正在寻找一种方法，将与开头网址相同（或被重定向）的url包含在parse_item回调将要处理的列表中。

Answer 1

对于其他人有同样的问题，经过大量搜索后，您需要做的就是将回调函数命名为parse_start_url。

例如：

rules = (
        Rule(LinkExtractor(allow=(), restrict_xpaths=(
            '//*[contains(concat( " ", @class, " " ), concat( " ", "pagination-next", " " ))]//a',)), callback="parse_start_url", follow=True),
    )

Answer 2

根据mindcast建议添加示例代码：

I manage using following approach

class ExampleSpider(CrawlSpider):
name = "examplespider"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/A']


rules = (Rule (SgmlLinkExtractor(restrict_xpaths=("//div[@id='tag_cloud']/a",)), callback="parse_items", follow= True),)

def parse_start_url(self, response):
    self.log('>>>>>>>> PARSE START URL: %s' % response)
    # www.example.com/A will be parsed here
    return self.parse_items(response)

def parse_items(self, response):
    self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)
    """Scrape data from links based on Crawl Rules"""

Answer 3

起初我认为有一个使用start_requests()的简单解决方案，如：

def start_requests(self):
    yield Request('START_URL_HERE', callback=self.parse_item)

但是测试表明，当使用start_requests()代替start_urls列表时，蜘蛛会忽略rules，因为CrawlSpider.parse(response)未被调用。

所以，这是我的解决方案：

import itertools
class SomeSpider(CrawlSpider):
    ....
    start_urls = ('YOUR_START_URL',)
    rules = (
        Rule(
            SgmlLinkExtractor(allow=(r'YOUR_REGEXP',),),
            follow=True,
            callback='parse_item'),
        ),
    )
    def parse(self, response):
        return itertools.chain(
                     CrawlSpider.parse(self, response), 
                     self.parse_item(response))

    def parse_item(self, response):
        yield item

也许有更好的方法。

Answer 4

一个简单的解决方法是专门为start_urls添加规则（在您的情况下：http://example.com/locationA）（请忽略缩进问题）：

class ExampleSpider(CrawlSpider):
  name = "examplespider"
  allowed_domains = ["example.com"]
  start_urls = ['http://example.com/locationA']

  rules = (
    Rule(LinkExtractor(allow=('locationA')), callback='parse_item'),
    Rule(LinkExtractor(allow=('location\.*?'),restrict_css=('.pagination-next',)), callback='parse_item', follow=True),
  )

  def parse_item(self, response):
       self.log('>>>>>>>> PARSE ITEM FROM %s' % response.url)

如何使用scrapy爬网蜘蛛在SgmlLinkExtractor中的“允许”规则中包含起始URL

4 个答案: