python scrapy css选择器名称提取不起作用

时间:2016-11-01 09:09:13

标签: python scrapy

我正在尝试使用css选择器从http://www.bschool.careers360.com/search/all/bangalore中提取大学名称但数据未提取。设置“ROBOTSTXT_OBEY = False”。更改后我的代码如下。但结果仍然相同

scope.$watch('vm.param.property', (new, old) => {
  const partner = vm.param.otherProperty;
  partner.value = partner.value + (new - old);
});

日志是

import scrapy

class BloreSpider(scrapy.Spider):
    name = 'blore'
    start_urls = ['http://www.engineering.careers360.com/search/college/bangalore']

    def parse(self, response):
        for quote in response.css('div.title'):
            yield {
                'author': quote.xpath('.//a/text()').extract_first(),
            }

        next_page = response.css('li.pager-next a::attr("href")').extract_first()
        if next_page:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

1 个答案:

答案 0 :(得分:0)

您的xpath必须与您的quote节点相关,换句话说,您需要在.之前添加//

试试这个:

def parse(self, response):
    for quote in response.css('div.title'):
        yield {
            #'author': quote.xpath('//a/text()').extract_first(),
            #                       ^
            'author': quote.xpath('.//a/text()').extract_first(),
        }

    next_page = response.css('li.pager-next a::attr("href")').extract_first()
    # if next_page is not None:
    if next_page:  # you can also just do this
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

编辑:在您尝试检索robots.txt时,查看您提供的日志似乎得到404。尝试在ROBOTS_TXT_OBEY = False

中设置settings.py