我写了一篇Python Scrapy蜘蛛来从一个网站上搜集60,000页。提取的每个项目都很有用,当刮大约4000页时,我没有问题,我从每个页面获得200状态,并且提取所需的所有数据。
然而,当我开始向上刮8000页时,我有时能够使用200个状态代码来抓取所有数据,但在其他情况下我返回的项目非常少。输出仍然显示所有8000个URL和每个200的状态代码,但提取的关联项很少。
我注意到如果抓取工具已成功抓取所有8000个页面,然后几乎立即再次运行,则会发生此行为。它在尝试刮掉所有60,000页时看到的行为相同。
我在settings.py文件中启用了以下功能:
ROBOTSTXT_OBEY = True
我的代码如下:
import scrapy
import re
import sqlite3
import time
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class UserItem(scrapy.Item):
url = scrapy.Field()
status = scrapy.Field()
registered = scrapy.Field()
class TestSpider(scrapy.Spider):
name = "website"
allowed_domains = ['source_website']
start_urls = ['http://source_website/page1',
'http://source_website/page2',
'http://source_website/page3',
'http://source_website/page4',
'http://source_website/page5',]
def parse(self, response):
hxs = scrapy.Selector(response)
all_links = hxs.xpath('*//a/@href').extract()
for link in all_links:
yield scrapy.http.Request(url=response.urljoin(link), callback=self.ParseContents)
def ParseContents(self, response):
item = UserItem()
item['url'] = response.url
item['status'] = response.status
data = (response.xpath('//html//text()').extract())
data = iter(data)
for object in data:
if "Registered" in object:
item['registered'] = (next(data))
yield item
成功的刮刮日志:
2018-03-10 17:49:09 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 2489755,
'downloader/request_count': 7524,
'downloader/request_method_count/GET': 7524,
'downloader/response_bytes': 19301544,
'downloader/response_count': 7524,
'downloader/response_status_count/200': 7520,
'downloader/response_status_count/302': 4,
'dupefilter/filtered': 45,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 3, 10, 17, 49, 9, 119209),
'item_scraped_count': 7517,
'log_count/DEBUG': 15046,
'log_count/INFO': 14,
'memusage/max': 123379712,
'memusage/startup': 49389568,
'offsite/domains': 2,
'offsite/filtered': 4,
'request_depth_max': 1,
'response_received_count': 7520,
'scheduler/dequeued': 7524,
'scheduler/dequeued/memory': 7524,
'scheduler/enqueued': 7524,
'scheduler/enqueued/memory': 7524,
'start_time': datetime.datetime(2018, 3, 10, 17, 42, 25, 211537)}
2018-03-10 17:49:09 [scrapy.core.engine] INFO: Spider closed (finished)
具有讽刺意味的是,在更新此版本时,我无法生成不成功的刮刮日志,因为此时蜘蛛刮擦了8000多页。
这可能是网站管理其接收的流量,因此蜘蛛的行为是零星的吗?
非常感谢任何关于如何解决这个问题的想法。