我正在抓取一系列公开报道的网站。我的项目是使用Scrapy构建的,我正在使用delta-fetch插件来确保一旦我从我的某个网站上抓取报告,我会在下次抓取时跳过该网址。
在使用delta-fetch进行初始抓取之后,其中已抓取网址并返回所有所需的项目,后续抓取应生成此类日志:
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58284.html>
但是,我的一只蜘蛛出了问题。初始爬网delta-fetch报告后:
2018-06-15 16:26:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats {'deltafetch/stored': 263,
鉴于此结果,在我的下一次抓取中,我应该看到263“INFO:Ignoring already visited:”日志消息以及最近在新网址上发布的新报告的抓取。这不是正在发生的事情......
这是第二次抓取时的日志输出。
2018-06-15 18:24:43 [scrapy.core.engine] INFO: Spider opened
2018-06-15 18:24:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-15 18:24:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-06-15 18:24:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ahjjjc.gov.cn/sggb/index.html> (referer: None)
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58704.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58284.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58283.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/56876.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/56008.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55844.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55845.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55741.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/index_2.html>
2018-06-15 18:24:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ahjjjc.gov.cn/sggb286/index.html> (referer: None)
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58962.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58960.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58920.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58803.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58757.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58725.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58687.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58686.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/index_2.html>
2018-06-15 18:24:47 [scrapy.core.engine] INFO: Closing spider (finished)
我的蜘蛛中的代码如下。我从目录页面开始抓取,然后遍历并请求每个文章网址,然后将其传递到我的report_paring方法,在该方法中抓取项目。
import scrapy
from scrapy.utils.request import request_fingerprint
...
class anhui_cdi(scrapy.Spider):
...
start_urls = [
'http://www.ahjjjc.gov.cn/sggb/index.html',
'http://www.ahjjjc.gov.cn/sggb286/index.html',
]
def parse(self,response):
urls = response.xpath('//dl[@class="clearfix"]/dt/a/@href').extract()
for href in urls:
yield response.follow(href, self.parse_docs, meta={'deltafetch_key': request_fingerprint(response.request)}
对于可能发生的事情的任何想法将不胜感激!
如果你很好奇......在settings.py
中进行delta-fetch设置SPIDER_MIDDLEWARES = {
'scrapy_deltafetch.DeltaFetch': 51,}
DELTAFETCH_ENABLED = True