我正在尝试检索goodreads中的引号,作者姓名和标签。我可以使用以下代码抓取单个页面
import scrapy
class goodReadsSpider(scrapy.Spider):
#identity
name='goodreads'
#requests
def start_requests(self):
url = 'https://www.goodreads.com/quotes?page=1'
yield scrapy.Request(url=url,callback=self.parse)
#reponse
def parse(self,response):
for quote in response.selector.xpath('//div[@class="quote"]'):
yield{
'text':quote.xpath('.//div[@class = "quoteText"]/text()[1]').extract(),
'author':quote.xpath('.//span[@class = "authorOrTitle"]').extract_first(),
'tags':quote.xpath('.//div[@class="greyText smallText left"]/a/text()').extract()
}
但是当我尝试通过添加以下代码来爬行同一蜘蛛时
next_page = response.selector.xpath('//a[@class = "next_page"/@href').extract()
if next_page is not None:
next_page_link = response.urljoin(next_page)
yield scrapy.request(url=next_page_link, callback=self.parse)
我收到以下错误消息。
2019-05-29 10:47:14 [scrapy.core.engine]信息:蜘蛛打开了
2019-05-29 10:47:14 [scrapy.extensions.logstats]信息:检索到0页 (0页/分钟),刮0件(0件/分钟)2019-05-29 10:47:14 [scrapy.extensions.telnet]信息:Telnet控制台正在监听 127.0.0.1:6023 2019-05-29 10:47:15 [scrapy.core.engine]调试:已爬网(200)https://www.goodreads.com/robots.txt>(指的是: 无)2019-05-29 10:47:16 [scrapy.core.engine]调试:爬行(200) https://www.goodreads.com/quotes?page=1>(引荐来源:无)
2019-05-29 10:47:16 [scrapy.core.scraper]调试:从<200刮下来 https://www.goodreads.com/quotes?page=1> {'text':[“ \ n“不要 因为结束而哭泣,因为发生而微笑。” \ n“],'作者': '\ n苏斯博士\ n','标签': [“无来源来源”,“哭泣”,“哭泣”,“经验”,“幸福”, “欢乐”,“生活”,“误配博士”,“乐观”,“悲伤”, 'smile','smiling']} 2019-05-29 10:47:16 [scrapy.core.scraper] 错误:蜘蛛错误处理https://www.goodreads.com/quotes?page=1>(参考:无)追溯 (最近通话最近):文件 “ c:\ programdata \ anaconda3 \ lib \ site-packages \ parsel \ selector.py”,行 238,在xpath中
** kwargs)文件“ src / lxml / etree.pyx”,位于lxml.etree._Element.xpath中的第1586行,文件“ src / lxml / xpath.pxi”,位于307行中 lxml.etree.XPathElementEvaluator。调用文件 “ src / lxml / xpath.pxi”,第227行 lxml.etree._XPathEvaluatorBase._handle_result
lxml.etree.XPathEvalError:无效的谓词在处理上述异常期间,发生了另一个异常:
回溯(最近通话最近):文件 “ c:\ programdata \ anaconda3 \ lib \ site-packages \ scrapy \ utils \ defer.py”, 第102行,位于iter_errback
产生next(it)文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ scrapy \ spidermiddlewares \ offsite.py”, 第29行,在process_spider_output
中 对于x结果:文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ scrapy \ spidermiddlewares \ referer.py”, 第339行,在
返回(_set_referer(r)表示结果r或())文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ scrapy \ spidermiddlewares \ urllength.py”, 第37行,在
返回(结果中的r代表r,如果_filter(r)则返回())文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ scrapy \ spidermiddlewares \ depth.py”,
中的第58行 返回(结果中的r代表r,如果_filter(r)则返回())文件“ C:\ Users \ Zona \ Documents \ Visual \ demo_project \ demo_project \ spiders \ goodreads.py”, 第23行,在解析中
next_page = response.selector.xpath('// a [@class =“ next_page” / @ href')。extract()文件 “ c:\ programdata \ anaconda3 \ lib \ site-packages \ parsel \ selector.py”,行 242,在xpath中
six.reraise(ValueError,ValueError(msg),sys.exc_info()[2])文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ six.py”,第692行,在 加薪
提高value.with_traceback(tb)文件“ c:\ programdata \ anaconda3 \ lib \ site-packages \ parsel \ selector.py”,行 238,在xpath中
** kwargs)文件“ src / lxml / etree.pyx”,位于lxml.etree._Element.xpath中的第1586行,文件“ src / lxml / xpath.pxi”,位于307行中 lxml.etree.XPathElementEvaluator。调用文件 “ src / lxml / xpath.pxi”,第227行 lxml.etree._XPathEvaluatorBase._handle_result ValueError:XPath 错误:// a [@class =“ next_page” / @ href
中的谓词无效 2019-05-29 10:47:16 [scrapy.core.engine]信息:关闭蜘蛛 (完成)2019-05-29 10:47:16 [scrapy.statscollectors]信息: 弃用Scrapy统计信息:{'downloader / request_bytes':621,
'downloader / request_count':2,
'downloader / request_method_count / GET':2,
'downloader / response_bytes':29812,'downloader / response_count':2, 'downloader / response_status_count / 200':2,'finish_reason': 'finished','finish_time':datetime.datetime(2019,5,29,5,47, 16,767370),'item_scraped_count':1,'log_count / DEBUG':3,
'log_count / ERROR':1,'log_count / INFO':9,
'response_received_count':2,'robotstxt / request_count':1,
'robotstxt / response_count':1,
'robotstxt / response_status_count / 200':1,'调度程序/出队':1, 'scheduler / dequeued / memory':1,'scheduler / enqueued':1,
'调度程序/排队/内存':1,'spider_exceptions / ValueError':1, 'start_time':datetime.datetime(2019,5,29,5,47,14,108786)}
2019-05-29 10:47:16 [scrapy.core.engine]信息:蜘蛛关闭 (完成)
我不确定问题是否出在xpath上,因为在第一次尝试中我会得到
'item_scraped_count':30
但是在这里它是1,这意味着Spider甚至都不会抓取第一页。
答案 0 :(得分:1)
您必须解决两个问题才能使下一页链接正常工作。除了@pako指出的内容以外,您可能已经使用.extract_first()
或.get()
来获取数组的第一项。纠正后的内容应该更像.xpath('//a[@class="next_page"]/@href').get()
。我已经重写了一些xpath,以消除输出中的空格。
class goodReadsSpider(scrapy.Spider):
name='goodreads'
start_urls = ['https://www.goodreads.com/quotes?page=1']
def parse(self,response):
for quote in response.xpath('//div[@class="quote"]'):
yield {
'text':quote.xpath('normalize-space(.//div[@class="quoteText"]/text())').getall(),
'author':quote.xpath('normalize-space(.//span[@class="authorOrTitle"]/text())').get(),
'tags':quote.xpath('.//div[contains(@class,"greyText")]/a/text()').getall()
}
next_page = response.xpath('//a[@class="next_page"]/@href').get()
if next_page:
nlink = response.urljoin(next_page)
yield scrapy.Request(url=nlink,callback=self.parse)