我已使用Scrapy从此基本URL https://www.lazada.vn/dien-thoai-di-dong/抓取102页中的所有图像。我将发送到下一页的请求的延迟时间设置为60秒,因为当Scrapy同时发送过多请求时,此域将阻止我的抓取过程。在过程日志中,我在前2页中看到许多下载通知行:
...
...
2019-12-08 12:32:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/robots.txt> (referer: None)
2019-12-08 12:32:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/dien-thoai-di-dong/?page=1> (referer: None)
2019-12-08 12:33:21 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/6d4a70571986291280d27d655f43c33b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ef903a6e40fac5cffde2fac25e9a695c.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5292c25961bf9109d5896bc56f06f1eb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/0e3f0321a5b12183d1caec077c5cddf7.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/c13858cf8aebf3a4474d07ca84100aca.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ac8dfba90e44a4db294ab1ea95d6ec6f.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7d84fca1f4e4a423a6f7ecca1b462c65.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/95c8d5c76b9edc0c13168ee52ddb55d2.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/16021992d9b9ffb1d31bd4ed967cfda5.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/97d51ca35fe5953903b2c53913dc6204.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/69a75548dcb1e779a2c9b183a467c9b1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5580ad66d41ce6eaa91be9113d8e49d1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/781c435a4e5d54ac0f0bd196cab6329b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/90af16a3a5318aa38ac470ac0e78b4e1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7afcecd58b2ba746ce0bc360e78304fb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/b6a6b9d9c1dca7eb4e79071ab5e04dfb.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/fd82fee9d2ec165e1c2bd5946d745660.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ffdd4cdc5b1580c0426c341f0e54c04a.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/65a9ab76d5192a49a90222a7fdbad59f.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/1ebb07247431af734a0f956d9124a2a1.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/098a6071eeddc4d526ff310c8f4edbe3.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/40756a8648be2dbb416890f4f74fda3e.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/8b4b463d6c90b902858606b5978a96ff.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/80dee15ec45cfad725976c5947bf237d.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/5326c3132c9c11559fc75fe9ae9e2b63.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/cebe568afdcbad9f3719d1751a9b1117.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/ce771129fe8859a4609e796d51dc56aa.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/b51e6fcf692e5316cacde25913b86e89.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/6231fb489a949f6a7bf882ad8e85965b.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/3d80fa52c934a3999c8837402f852419.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/7016e51afc586d8725fec94f481f89a6.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/2e60c3233a708fa49c06964ed88792ba.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/2a347b03642f5e53c90fa03cfe8af63e.jpg> referred in <None>
2019-12-08 12:33:21 [scrapy.pipelines.files] DEBUG: File (uptodate): Downloaded image from <GET https://vn-test-11.slatic.net/p/14adae58e3f65f087b6034eb165a1f20.jpg> referred in <None>
...
...
但是从第三页到最后,我再也看不到这些了:
...
...
2019-12-08 12:34:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.lazada.vn/dien-thoai-di-dong/?page=3> (referer: None)
2019-12-08 12:35:22 [scrapy.extensions.logstats] INFO: Crawled 13 pages (at 6 pages/min), scraped 80 items (at 40 items/min)
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg'],
'images': [{'checksum': 'dca1d5a23d29d3d1a854d35ff578e3f4',
'path': 'full/e3bea7d5eb5bc56158e4e69f0312877c96e5ac6f.jpg',
'url': 'https://vn-test-11.slatic.net/p/d05e089b69960fa7e12851639db54833.jpg'}],
'price': '1290000.00',
'title': 'Điện thoại oppo a37 neo9 fullbox ram2 bộ nhớ 16gb'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg'],
'images': [{'checksum': '97f7bf79aed3e1013510e30673a35ee8',
'path': 'full/e0b37ed8016f08a3aeb08cbf22c4096a1bf37fca.jpg',
'url': 'https://vn-test-11.slatic.net/p/e8a91ffdc7d36fd708bcc959b1e85a05.jpg'}],
'price': '2890000.00',
'title': 'Điện thoại oppo f9 fullbox ram4 bộ nhớ 64gb Liên quân pubg chiến '
'mượt'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg'],
'images': [{'checksum': '9b3990378a8b5152969369fcba144271',
'path': 'full/e2d7a2a6a18b6791a8ed1d7fe8d9d35713f07d76.jpg',
'url': 'https://vn-test-11.slatic.net/p/9b72137283d02a2c76ecc1a06f78ef5d.jpg'}],
'price': '3390000.00',
'title': 'Điện thoại IPH0NE_8_PLUS Hàng fullbox 256GB, tặng Tai nghe '
'Bluetooth, Xả khó giá cực sốc'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/0951417571ed48aafd9e0b0108a42cb4.jpg'],
'images': [{'checksum': 'f88bd5e92730b682b5d1925bbae3be4d',
'path': 'full/4737babbe6e3238252c00a882e8bf9ab6529658e.jpg',
'url': 'https://vn-test-11.slatic.net/p/0951417571ed48aafd9e0b0108a42cb4.jpg'}],
'price': '1850000.00',
'title': 'ĐIện_Thoại_IPHONE7_PLUS_256GB'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/55a39333530c4045e10b4589be0ad36a.jpg'],
'images': [{'checksum': '0423081571336a25c01c8aa0d99c0458',
'path': 'full/532b38fb6a4c95a55b0892e8fe76bae8ff031d26.jpg',
'url': 'https://vn-test-11.slatic.net/p/55a39333530c4045e10b4589be0ad36a.jpg'}],
'price': '2749900.00',
'title': 'ĐIện_Thoại_IPHONEXS_MAX_512GB'}
2019-12-08 12:35:23 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.lazada.vn/dien-thoai-di-dong/?page=3>
{'image_urls': ['https://vn-test-11.slatic.net/p/a1162694a5b65056d8b2fff54d2fd7b7.jpg'],
'images': [{'checksum': '78d8042b5b80d3f713846fab71ac583d',
'path': 'full/0a8bd03ec3f5cb2c96439c99c503a9e40aa8afb3.jpg',
'url': 'https://vn-test-11.slatic.net/p/a1162694a5b65056d8b2fff54d2fd7b7.jpg'}],
'price': '1950000.00',
'title': 'ĐIện_Thoại_IPHONE8_PLUS_256GB'}
...
...
在该过程结束时,结果日志显示Scrapy抓取了102页包含约4000张图像,但仅下载了153张图像:
2019-12-08 11:34:21 [scrapy.extensions.logstats] INFO: Crawled 259 pages (at 1 pages/min), scraped 4000 items (at 0 items/min)
2019-12-08 11:34:21 [scrapy.core.engine] INFO: Closing spider (finished)
2019-12-08 11:34:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 70175,
'downloader/request_count': 259,
'downloader/request_method_count/GET': 259,
'downloader/response_bytes': 27542657,
'downloader/response_count': 259,
'downloader/response_status_count/200': 259,
'elapsed_time_seconds': 6292.681676,
'file_count': 153,
'file_status_count/downloaded': 153,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 12, 8, 11, 34, 21, 729083),
'item_scraped_count': 4000,
'log_count/DEBUG': 4412,
'log_count/INFO': 114,
'memusage/max': 128307200,
'memusage/startup': 55721984,
'response_received_count': 259,
'robotstxt/request_count': 4,
'robotstxt/response_count': 4,
'robotstxt/response_status_count/200': 4,
'scheduler/dequeued': 102,
'scheduler/dequeued/memory': 102,
'scheduler/enqueued': 102,
'scheduler/enqueued/memory': 102,
'start_time': datetime.datetime(2019, 12, 8, 9, 49, 29, 47407)}
这是我的代码:
SIPDER
import scrapy
import re
import json
from scrapy_lazada_test.items import ScrapyLazadaTestItem
class LazadaSpider(scrapy.Spider):
name = "lazada"
allowed_domains = ['lazada.vn']
def start_requests(self):
max_page_number = 102
base_url = 'https://www.lazada.vn/dien-thoai-di-dong/'
for i in range(1, max_page_number + 1):
url = base_url + '?page=' + str(i)
#delay before sending request to move to next page
time.sleep(60)
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
result = response.xpath('//html/body/script[@type="application/ld+json"][2]').re(r'(?<=itemListElement":)(.*?)(\}\<\/script>)')
products = json.loads(result[0])
for p in products:
item = ScrapyLazadaTestItem()
item["image_urls"] = [p["image"]]
item["title"] = p["name"]
item["price"] = p["offers"]["price"]
yield item
ITEM
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class ScrapyLazadaTestItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
images = scrapy.Field()
image_urls = scrapy.Field()
设置
BOT_NAME = 'scrapy_lazada_test'
SPIDER_MODULES = ['scrapy_lazada_test.spiders']
NEWSPIDER_MODULE = 'scrapy_lazada_test.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'scrapy_lazada_test (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "/home/mmlab/scrapy_lazada_test/result/"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 15
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
#CONCURRENT_REQUESTS_PER_IP = 16
...
...
我尝试设置CONCURRENT_REQUESTS = 1
和CONCURRENT_REQUESTS_PER_DOMAIN = 1
,但其工作方式与以前相同。我该如何解决?