我正在运行如下定义的蜘蛛:
class ApkmirrorSitemapSpider(SitemapSpider, BaseSpider):
name = 'apkmirror'
sitemap_urls = ['http://www.apkmirror.com/sitemap_index.xml']
sitemap_rules = [(r'.*-android-apk-download/$', 'parse')]
custom_settings = {
'CLOSESPIDER_PAGECOUNT': 0,
'CLOSESPIDER_ERRORCOUNT': 1,
'CONCURRENT_REQUESTS': 32,
'CONCURRENT_REQUESTS_PER_DOMAIN': 16,
'TOR_RENEW_IDENTITY_ENABLED': True,
'TOR_ITEMS_TO_SCRAPE_PER_IDENTITY': 50,
'FEED_URI': '/scraper/apkmirror_scraper/data/apkmirror.json',
'FEED_FORMAT': 'json',
'DUPEFILTER_CLASS': 'apkmirror_scraper.dupefilters.URLDupefilter',
'DUPEFILTER_DEBUG': True
}
download_timeout = 60 * 15.0 # Allow 15 minutes for downloading APKs
def start_requests(self):
for url in self.sitemap_urls:
yield scrapy.Request(url, self._parse_sitemap, dont_filter=True)
parse
类中定义了BaseSpider
方法。当我运行蜘蛛(经过几次暂停并使用JOBDIR
恢复)后,我得到finished
结果:
scraper_1 | 2017-06-22 04:54:24 [scrapy.core.engine] INFO: Closing spider (finished)
scraper_1 | 2017-06-22 04:54:24 [scrapy.extensions.feedexport] INFO: Stored json feed (1770 items) in: /scraper/apkmirror_scraper/data/apkmirror.json
scraper_1 | 2017-06-22 04:54:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
scraper_1 | {'downloader/exception_count': 328,
scraper_1 | 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 55,
scraper_1 | 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 273,
scraper_1 | 'downloader/request_bytes': 4655909,
scraper_1 | 'downloader/request_count': 9184,
scraper_1 | 'downloader/request_method_count/GET': 9184,
scraper_1 | 'downloader/response_bytes': 46533396148,
scraper_1 | 'downloader/response_count': 8856,
scraper_1 | 'downloader/response_status_count/200': 6980,
scraper_1 | 'downloader/response_status_count/301': 2,
scraper_1 | 'downloader/response_status_count/302': 1667,
scraper_1 | 'downloader/response_status_count/403': 39,
scraper_1 | 'downloader/response_status_count/404': 20,
scraper_1 | 'downloader/response_status_count/429': 144,
scraper_1 | 'downloader/response_status_count/500': 4,
scraper_1 | 'dupefilter/filtered': 34509,
scraper_1 | 'file_count': 3359,
scraper_1 | 'file_status_count/downloaded': 3256,
scraper_1 | 'file_status_count/uptodate': 103,
scraper_1 | 'finish_reason': 'finished',
scraper_1 | 'finish_time': datetime.datetime(2017, 6, 22, 4, 54, 24, 601909),
scraper_1 | 'httperror/response_ignored_count': 135,
scraper_1 | 'httperror/response_ignored_status_count/403': 9,
scraper_1 | 'httperror/response_ignored_status_count/404': 20,
scraper_1 | 'httperror/response_ignored_status_count/429': 106,
scraper_1 | 'item_scraped_count': 1770,
scraper_1 | 'log_count/DEBUG': 213244,
scraper_1 | 'log_count/INFO': 5760,
scraper_1 | 'log_count/WARNING': 1813,
scraper_1 | 'memusage/max': 1292906496,
scraper_1 | 'memusage/startup': 75767808,
scraper_1 | 'request_depth_max': 3,
scraper_1 | 'response_received_count': 7183,
scraper_1 | 'retry/count': 292,
scraper_1 | 'retry/max_reached': 40,
scraper_1 | 'retry/reason_count/500 Internal Server Error': 4,
scraper_1 | 'retry/reason_count/twisted.internet.error.TimeoutError': 46,
scraper_1 | 'retry/reason_count/twisted.web._newclient.ResponseFailed': 242,
scraper_1 | 'scheduler/dequeued': 3861,
scraper_1 | 'scheduler/dequeued/disk': 3861,
scraper_1 | 'scheduler/enqueued': 3861,
scraper_1 | 'scheduler/enqueued/disk': 3861,
scraper_1 | 'start_time': datetime.datetime(2017, 6, 21, 17, 59, 45, 954085)}
scraper_1 | 2017-06-22 04:54:24 [scrapy.core.engine] INFO: Spider closed (finished)
但是,生成的JSON文件似乎缺少结束括号。结尾如下:
{"url": "http://www.apkmirror.com/apk/google-inc/google-play-games-android-tv/google-play-games-android-tv-3-9-08-release/google-play-games-3-9-08-3448271-846-android-apk-download/", "title": "Google Play Games (Android TV) 3.9.08 (3448271-846) (arm64) (nodpi) (Android 5.0+)", "developer": "Google Inc.", "app": "Google Play Games", "version_name": "3.9.08 (3448271-846)", "version_code": "39080846", "architectures": ["arm64"], "package": "com.google.android.play.games", "apk_file_size": 23454231, "android_min_version": "5.0", "android_target_version": "6.0", "supported_dpis": ["nodpi"], "md5_signature": "f481aeab3540bdaf7457b1fe11d31851", "time_uploaded": "2016-11-19 07:20:00", "time_scraped": "2017-06-02 11:10:41", "image_urls": ["http://www.apkmirror.com/wp-content/themes/APKMirror/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.apkmirror.com%2Fwp-content%2Fuploads%2F2016%2F11%2F582fbbe24dbe4.png&w=96&h=96&q=100"], "file_urls": ["http://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=140686"], "images": [{"url": "http://www.apkmirror.com/wp-content/themes/APKMirror/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.apkmirror.com%2Fwp-content%2Fuploads%2F2016%2F11%2F582fbbe24dbe4.png&w=96&h=96&q=100", "path": "full/fd01c795aea2baeadf5812bbb8682c38ad3ab5bf.jpg", "checksum": "d284bafd5f7a2dd16b190c6342f5998f"}], "files": []},
{"url": "http://www.apkmirror.com/apk/google-inc/maps/maps-9-42-0-release/maps-navigation-transit-9-42-0-4-android-apk-download/", "title": "Google Maps - Navigation & Transit 9.42.0 beta (arm) (320dpi) (Android 4.3+)", "developer": "Google Inc.", "app": "Google Maps - Navigation & Transit", "version_name": "9.42.0", "version_code": "942005123", "architectures": ["arm"], "package": "com.google.android.apps.maps", "apk_file_size": 37919295, "android_min_version": "4.3", "android_target_version": "7.1", "supported_dpis": ["320dpi"], "md5_signature": "1cc90b11abcf3efbe95c478393cb07e6", "time_uploaded": "2016-11-19 05:36:00", "time_scraped": "2017-06-02 11:11:37", "image_urls": ["http://www.apkmirror.com/wp-content/themes/APKMirror/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.apkmirror.com%2Fwp-content%2Fuploads%2F2016%2F11%2F582fc92ce70bb.png&w=96&h=96&q=100"], "file_urls": ["http://www.apkmirror.com/wp-content/themes/APKMirror/download.php?id=140704"], "images": [{"url": "http://www.apkmirror.com/wp-content/themes/APKMirror/ap_resize/ap_resize.php?src=http%3A%2F%2Fwww.apkmirror.com%2Fwp-content%2Fuploads%2F2016%2F11%2F582fc92ce70bb.png&w=96&h=96&q=100", "path": "full/55b4f171ca5e156ed3642abfa38c7b4aad206c09.jpg", "checksum": "39d7fdca0ab6384cb2f14ad1ea1474bd"}], "files": []},
与(默认)JSON行(.jl)格式不同,单个词典以逗号分隔,但似乎缺少结束方括号。 (我已经确认有一个开口方括号)。
我觉得这有些可疑:如果蜘蛛关闭正常,不应该将结束括号添加到JSON文件中吗?