Question

我正在尝试使用Rule类转到我的抓取工具中的下一页。这是我的代码

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from crawler.items import GDReview


class GdSpider(CrawlSpider):
    name = "gd"
    allowed_domains = ["glassdoor.com"]
    start_urls = [
        "http://www.glassdoor.com/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm"
    ]

    rules = (

        # Extract next links and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(restrict_xpaths=('//li[@class="next"]/a/@href',)), follow= True)
    )


    def parse(self, response):
        company_name = response.xpath('//*[@id="EIHdrModule"]/div[3]/div[2]/p/text()').extract()

        '''loop over every review in this page'''
        for sel in response.xpath('//*[@id="EmployerReviews"]/ol/li'):
            review = Item()
            review['company_name'] = company_name
            review['id'] = str(sel.xpath('@id').extract()[0]).split('_')[1] #sel.xpath('@id/text()').extract()
            review['body'] = sel.xpath('div/div[3]/div/div[2]/p/text()').extract()
            review['date'] = sel.xpath('div/div[1]/div/time/text()').extract()
            review['summary'] = sel.xpath('div/div[2]/div/div[2]/h2/tt/a/span/text()').extract()

            yield review

我的问题是关于规则部分。在此规则中，提取的链接不包含域名。例如，它将返回类似的内容 “/Reviews/Johnson-and-Johnson-Reviews-E364_P1.htm”

如何确保我的抓取工具将域附加到返回的链接？

由于

Answer 1

您可以确定，因为这是Scrapy中链接提取器的默认行为（source code）。

此外，restrict_xpaths参数不应指向@href属性，而应指向a元素或具有a元素作为后代的容器。另外，restrict_xpaths可以定义为字符串。

换句话说，替换：

restrict_xpaths=('//li[@class="next"]/a/@href',)

使用：

restrict_xpaths='//li[@class="next"]/a'

此外，您需要从SgmlLinkExtractor切换到LxmlLinkExtractor：

基于SGMLParser的链接提取器是未使用的，其用途是泄气。如果您愿意，建议迁移到LxmlLinkExtractor 仍在使用SgmlLinkExtractor。

就个人而言，我通常使用LinkExractor的{{1}}快捷方式：

LxmlLinkExtractor

总结一下，这就是我在from scrapy.contrib.linkextractors import LinkExtractor中所拥有的：

rules

如何在scrapy中使用Rule类

1 个答案: