虽然我编写了一个脚本来从站点抓取数据,并且运行正常,但是在抓取大约18页的数据(因为大约有42页)之后,通过在日志中前后提供日志信息,使scrapy陷入困境。
我访问了关于stackoverflow的类似问题,但是在所有这些脚本中,脚本从一开始就不起作用,而在我的情况下,该脚本从大约18页中抓取数据,然后卡住了。
这是脚本
# -*- coding: utf-8 -*-
import scrapy
import logging
class KhaadiSpider(scrapy.Spider):
name = 'khaadi'
start_urls = ['https://www.khaadi.com/pk/woman.html/']
def parse(self, response):
urls= response.xpath('//ol/li/div/a/@href').extract()
for url in urls:
yield scrapy.Request(url, callback=self.product_page)
next_page=response.xpath('//*[@class="action next"]/@href').extract_first()
while(next_page!=None):
yield scrapy.Request(next_page)
logging.info("Scraped all the pages Successfuly....")
def product_page(self,response):
image= response.xpath('//*[@class="MagicZoom"]/@href').extract_first()
page_title= response.xpath('//*[@class="page-title"]/span/text()').extract_first()
price=response.xpath('//*[@class="price"]/text()').extract_first()
page_url=response.url
yield {'Image':image,
"Page Title":page_title,
"Price":price,
"Page Url":page_url
}
这是记录器信息
2019-10-05 11:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/ksffs19301-blue.html> (referer: https://www.khaadi.com/pk/woman.html?p=18)
2019-10-05 11:22:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/b19428-pink-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/b/1/b19428b.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/b19428-pink-3pc.html', 'Price': u'PKR2,170'}
2019-10-05 11:22:06 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/i19417-blue-2pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/i/1/i19417b.jpg', 'Page Title': u'Shirt Shalwar', 'Page Url': 'https://www.khaadi.com/pk/i19417-blue-2pc.html', 'Price': u'PKR1,680'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wshe19498-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/s/wshe19498_2_.jpg', 'Page Title': u'Embroidered Shalwar', 'Page Url': 'https://www.khaadi.com/pk/wshe19498-off-white.html', 'Price': u'PKR1,800'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19401-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19401_offwhite__1_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19401-off-white.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19402-black.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19402_black__2_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19402-black.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/k19407-yellow-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/1/k19407b.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/k19407-yellow-3pc.html', 'Price': u'PKR2,940'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/k19408-blue-3pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/1/k19408a.jpg', 'Page Title': u'Shirt Shalwar Dupatta', 'Page Url': 'https://www.khaadi.com/pk/k19408-blue-3pc.html', 'Price': u'PKR2,940'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wet19408-pink.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/e/wet19408_pink__1_.jpg', 'Page Title': u'EMBELLISHED TIGHTS', 'Page Url': 'https://www.khaadi.com/pk/wet19408-pink.html', 'Price': u'PKR1,000'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/wbme19474-off-white.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/w/b/wbme19474_offwhite__1_.jpg', 'Page Title': u'Embroidered Metallica Pants', 'Page Url': 'https://www.khaadi.com/pk/wbme19474-off-white.html', 'Price': u'PKR2,400'}
2019-10-05 11:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/ksffs19301-blue.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/k/s/ksffs19301_blue__2_.jpg', 'Page Title': u'Semi Formal Full Suit', 'Page Url': 'https://www.khaadi.com/pk/ksffs19301-blue.html', 'Price': u'PKR18,000'}
2019-10-05 11:22:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 326 pages/min), scraped 307 items (at 307 items/min)
2019-10-05 11:23:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:24:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:25:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:26:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:27:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:28:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:29:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:30:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:31:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:32:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:33:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:34:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:35:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
2019-10-05 11:36:24 [scrapy.extensions.logstats] INFO: Crawled 326 pages (at 0 pages/min), scraped 307 items (at 0 items/min)
所有其他文件都保留为默认文件。
答案 0 :(得分:1)
您的页面前进逻辑是正确的,但是看来您要爬网的服务器可能已具备一些防爬网防御机制。
按原样运行代码时,我得到了类似的结果,一段时间后基本上停止抓取。我怀疑服务器检测到它正在被抓取,或者放慢了速度,或者完全停止了对抓取请求的响应。
仅出于测试目的,我对代码进行了一些调整,以免使服务器受到严重影响,希望保持在抓取检测雷达之下:
for url in urls:
yield scrapy.Request(url, callback=self.product_page)
break # only scrape one product per page
next_page = response.xpath('//*[@class="action next"]/@href').extract_first()
while (next_page != None):
time.sleep(2) # slow down scraping rate
if next_page.endswith('p=2'):
# jump to page 18, skipping what is known to work fine
next_page = re.sub('p=2', 'p=18', next_page)
yield scrapy.Request(next_page)
在进行了这些更改之后,我可以看到(缓慢地)抓取内容越过了页面,页面越早停止并且仍在继续:
2019-10-05 12:57:28 [scrapy.extensions.logstats] INFO: Crawled 32 pages (at 0 pages/min), scraped 15 items (at 0 items/min)
2019-10-05 12:57:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/j19405-off-white-2pc.html> (referer: https://www.khaadi.com/pk/woman.html?p=33)
2019-10-05 12:58:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.khaadi.com/pk/woman.html?p=34> (referer: https://www.khaadi.com/pk/woman.html?p=33)
2019-10-05 12:58:28 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 2 pages/min), scraped 15 items (at 0 items/min)
2019-10-05 12:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.khaadi.com/pk/j19405-off-white-2pc.html>
{'Image': u'https://www.khaadi.com/media/catalog/product/cache/10f519365b01716ddb90abc57de5a837/j/1/j19405a.jpg', 'Page Title': u'Shirt Shalwar', 'Page Url': 'https://www.khaadi.com/pk/j19405-off-white-2pc.html', 'Price': u'PKR1,190'}
2019-10-05 12:59:28 [scrapy.extensions.logstats] INFO: Crawled 34 pages (at 0 pages/min), scraped 16 items (at 1 items/min)
即使检测到这些调整,也可以检测到抓取,因此我进一步对其进行了调整,以跳过更多页面,最终到达最后一页并显示
2019-10-05 14:04:26 [root] INFO: Scraped all the pages Successfuly....
但是scrapy不会关闭,您需要对此进行更多调整。