全部,
我正在写一个scrapy爬虫,这是我之前提出的关于它的问题:Scrapy: AttributeError: 'YourCrawler' object has no attribute 'parse_following_urls'。
现在我遇到了另一个问题:它不想进入下一页:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "bookstore_2"
start_urls = [
'https://example.com/materias/?novedades=LC&p',
]
allowed_domains = ["https://example.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('div#main'):
yield {
'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
}
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
它可以工作并保存第一页链接的数据,但在尝试转到下一页时没有任何错误就失败了。这是日志:
…
2017-07-08 17:17:25 [scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com/book/?id=9780143039617>
{'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&editorial=Penguin%20Books">Penguin Books</a></li>', u'<li>P\xe1ginas: 363</li>', u'<li>A\xf1o: 2206</li>', u'<li>Precio: 14.50 \u20ac</li>', u'<li>EAN: 9780143039617</li>']}
2017-07-08 17:17:25 [scrapy.core.engine] INFO: Closing spider (finished)
2017-07-08 17:17:25 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: bookstore_2.json
2017-07-08 17:17:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
我在第一个蜘蛛中使用了下一页的部分,它正在运行。知道为什么会发生这种情况吗?
答案 0 :(得分:1)
您的分页逻辑应该在parse
方法而不是parse_following_urls
方法的末尾,因为分页链接位于主页上,而不是在书籍详细信息页面上。另外,我必须从allowed_domains
中删除该方案。最后,请注意,由于您没有导入Request
模块,因此在parse
方法结束时会产生scrapy
。蜘蛛看起来像这样:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
class YourCrawler(CrawlSpider):
name = "bookstore_2"
start_urls = [
'https://lacentral.com/materias/?novedades=LC&p',
]
allowed_domains = ["lacentral.com"]
def parse(self, response):
# go to the urls in the list
s = Selector(response)
page_list_urls = s.xpath('///[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract()
for url in page_list_urls:
yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True)
# Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again
next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield Request(next_page, callback=self.parse)
# For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li
def parse_following_urls(self, response):
#Parsing rules go here
for each_book in response.css('div#main'):
yield {
'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(),
}