Scrapy delta-fetch没有抓取新物品

时间:2018-06-15 19:01:27

标签: python web-scraping scrapy

我正在抓取一系列公开报道的网站。我的项目是使用Scrapy构建的,我正在使用delta-fetch插件来确保一旦我从我的某个网站上抓取报告,我会在下次抓取时跳过该网址。

在使用delta-fetch进行初始抓取之后,其中已抓取网址并返回所有所需的项目,后续抓取应生成此类日志:

2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58284.html>

但是,我的一只蜘蛛出了问题。初始爬网delta-fetch报告后:

2018-06-15 16:26:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats {'deltafetch/stored': 263, 

鉴于此结果,在我的下一次抓取中,我应该看到263“INFO:Ignoring already visited:”日志消息以及最近在新网址上发布的新报告的抓取。这不是正在发生的事情......

这是第二次抓取时的日志输出。

2018-06-15 18:24:43 [scrapy.core.engine] INFO: Spider opened
2018-06-15 18:24:43 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-06-15 18:24:43 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-06-15 18:24:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ahjjjc.gov.cn/sggb/index.html> (referer: None)
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58704.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58284.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/58283.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/56876.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/56008.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55844.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55845.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/p/55741.html>
2018-06-15 18:24:45 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb/index_2.html>
2018-06-15 18:24:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.ahjjjc.gov.cn/sggb286/index.html> (referer: None)
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58962.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58960.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58920.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58803.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58757.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58725.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58687.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/p/58686.html>
2018-06-15 18:24:47 [scrapy_deltafetch.middleware] INFO: Ignoring already visited: <GET http://www.ahjjjc.gov.cn/sggb286/index_2.html>
2018-06-15 18:24:47 [scrapy.core.engine] INFO: Closing spider (finished)

我的蜘蛛中的代码如下。我从目录页面开始抓取,然后遍历并请求每个文章网址,然后将其传递到我的report_paring方法,在该方法中抓取项目。

import scrapy
from scrapy.utils.request import request_fingerprint 
...
class anhui_cdi(scrapy.Spider):
...
    start_urls = [
            'http://www.ahjjjc.gov.cn/sggb/index.html',
            'http://www.ahjjjc.gov.cn/sggb286/index.html',
    ]

    def parse(self,response):
        urls = response.xpath('//dl[@class="clearfix"]/dt/a/@href').extract()
        for href in urls:
            yield response.follow(href, self.parse_docs, meta={'deltafetch_key': request_fingerprint(response.request)}

对于可能发生的事情的任何想法将不胜感激!

如果你很好奇......在settings.py

中进行delta-fetch设置
SPIDER_MIDDLEWARES = {
    'scrapy_deltafetch.DeltaFetch': 51,}

DELTAFETCH_ENABLED = True

0 个答案:

没有答案