我目前正在开展一个项目,以抓取类似Wiki的网站,这是一个历史人物数据库和一些基本信息(所有个人信息都在自己的页面上)。有几百万(略少于300万)的名字,所以我希望爬虫实际上完全刮掉信息,同时不伤害网站。我对此完全陌生,所以我想知道人们是否可以指导我进行网络抓取的最佳实践。具体来说,我在下面的设置文件中发布了一些内容:
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = True
我已取消注释AutoThrottle设置并将DOWNLOAD_DELAY设置为5秒。然而,这将使刮刀移动得太慢。如果我不想被禁止,这是不可避免的吗?人们通常将DOWNLOAD_DELAY参数设置为?