我正在尝试使用css选择器从http://www.bschool.careers360.com/search/all/bangalore中提取大学名称但数据未提取。设置“ROBOTSTXT_OBEY = False”。更改后我的代码如下。但结果仍然相同
scope.$watch('vm.param.property', (new, old) => {
const partner = vm.param.otherProperty;
partner.value = partner.value + (new - old);
});
日志是
import scrapy
class BloreSpider(scrapy.Spider):
name = 'blore'
start_urls = ['http://www.engineering.careers360.com/search/college/bangalore']
def parse(self, response):
for quote in response.css('div.title'):
yield {
'author': quote.xpath('.//a/text()').extract_first(),
}
next_page = response.css('li.pager-next a::attr("href")').extract_first()
if next_page:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
答案 0 :(得分:0)
您的xpath必须与您的quote
节点相关,换句话说,您需要在.
之前添加//
。
试试这个:
def parse(self, response):
for quote in response.css('div.title'):
yield {
#'author': quote.xpath('//a/text()').extract_first(),
# ^
'author': quote.xpath('.//a/text()').extract_first(),
}
next_page = response.css('li.pager-next a::attr("href")').extract_first()
# if next_page is not None:
if next_page: # you can also just do this
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
编辑:在您尝试检索robots.txt时,查看您提供的日志似乎得到404。尝试在ROBOTS_TXT_OBEY = False
settings.py