Question

我正在尝试学习python和scrapy，但我遇到了CrawlSpider的问题。下面的代码适合我。它将启动URL中与xpath匹配的所有链接 - //div[@class="info"]/h3/a/@href然后将这些链接传递给函数 parse_dir_contents 。

我现在需要的是让抓取工具移动到下一页。我试图使用规则和linkextractor但我似乎无法让它正常工作。我还尝试使用//a/@href作为 parse 函数的xpath，但它不会将链接传递给 parse_dir_contents 函数。我想我错过了一些非常简单的东西。有什么想法吗？

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=['restaurants?page=[1-2]']), callback="parse")
]

def parse(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        ---extra items here---
        yield item

修改这是具有三个功能的更新代码，可以抓取150个项目。我认为这是我的规则的一个问题，但我已经尝试了我认为可行的方法，但仍然是相同的输出。

class ypSpider(CrawlSpider):
name = "ypTest"
download_delay = 2
allowed_domains = ["yellowpages.com"]
start_urls = ["http://www.yellowpages.com/new-york-ny/restaurants?page=1"]

rules = [
    Rule(LinkExtractor(allow=[r'restaurants\?page\=[1-2]']), callback='parse')
]

def parse(self, response):
    for href in response.xpath('//a/@href'):
        url = response.urljoin(href.extract())
        if 'restaurants?page=' in url:
            yield scrapy.Request(url, callback=self.parse_links)


def parse_links(self, response):
    for href in response.xpath('//div[@class="info"]/h3/a/@href'):
        url = response.urljoin(href.extract())
        if 'mip' in url:
            yield scrapy.Request(url, callback=self.parse_page)


def parse_page(self, response):
    for sel in response.xpath('//div[@id="mip"]'):
        item = ypItem()
        item['url'] = response.url
        item['business'] = sel.xpath('//div/div/h1/text()').extract()
        item['phone'] = sel.xpath('//div/div/section/div/div[2]/p[3]/text()').extract()
        item['street'] = sel.xpath('//div/div/section/div/div[2]/p[1]/text()').re(r'(.+)\,')
        item['city'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(.+)\,')
        item['state'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'\,\s(.+)\s\d')
        item['zip'] = sel.xpath('//div/div/section/div/div[2]/p[2]/text()').re(r'(\d+)')
        item['category'] = sel.xpath('//dd[@class="categories"]/span/a/text()').extract()
        yield item

Answer 1

CrawlSpider将parse例程用于自己的目的，将parse()重命名为其他内容，更改rules[]中的回调以匹配并重试。

Answer 2

我知道现在回答这个问题已经很晚了，但是我设法解决了这个问题，并且发布了答案，因为它可能对像我这样对如何使用草率Rule和{ {1}}首先。

这是我的工作代码：

LinkExtractor

因此，我设法了解了# -*- coding: utf-8 -*- import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class ypSpider(CrawlSpider): name = "ypTest" allowed_domains = ["yellowpages.com"] start_urls = ['http://www.yellowpages.com/new-york-ny/restaurants' ] rules = ( Rule(LinkExtractor(allow=[r'restaurants\?page=\d+']), follow=True), # Scrapes all the pagination links Rule(LinkExtractor(restrict_xpaths="//div[@class='scrollable-pane']//a[@class='business-name']"), callback='parse_item'), # Scrapes all the restaurant detail links and use `parse_item` as a callback method ) def parse_item(self, response): yield { 'url' : response.url }和Rule在这种情况下的工作原理。

第一个LinkExtractor条目用于抓取所有分页链接，而Rule函数中的allow参数基本上是使用LinkExtractor来仅传递与{{ 1}}。在这种情况下，按照regex，仅包含regex之类的模式的链接，其中regex表示一个或多个数字。另外，它使用默认的restaurants\?page=\d+方法作为回调。（在这种情况下，我可以使用\d+参数来选择HTML中特定区域下的那些链接，而不选择parse参数，但是我可以使用它来了解它如何与restrict_xpath一起使用）

第二个allow用于获取所有餐厅的详细信息链接，并使用regex方法进行解析。在此Rule中，我们使用的parse_item参数定义了响应中应从中提取链接的区域。在这里，我们仅提取Rule类restrict_xpaths下的那些内容，并且仅提取div类的那些链接，就好像您检查HTML一样，您会发现不止一个在同一scrollable-pane中使用不同的查询参数链接到同一餐厅。最后，我们传递了回调方法business-name。

现在，当我运行此蜘蛛时，在这种情况下，它会获取所有餐厅（纽约州纽约的餐厅）的详细信息，总计3030。

问：Scrapy：未抓取下一页但抓取工具似乎是在关注链接

2 个答案: