如何覆盖make_requests_from_url使scont中的dont_filter变为False

时间:2016-12-13 14:18:02

标签: python scrapy scrapy-spider

我是scrapy的新手。

作为scrapy中的make_requests_from_url()函数(如下所示),它将dont_filter设置为true。

def make_requests_from_url(self, url):
    return Request(url, dont_filter=True)

我想覆盖它以更改dont_filter为假。

def make_requests_from_url(self, url):
    return Request(url, callback=self.parse, dont_filter=False)

但它不起作用,我的意思是它无法递归抓取网站,只是立即停止。

这是我的完整代码:

# coding:utf-8
from scrapy.http import Request
import re
import scrapy.dupefilter


class DuilianSpider(scrapy.Spider):

    name = "duilian"
    allowed_domains = ["duilian.com"]
    start_urls = [
        "http://www.duiduilian.com/daquan",
        "http://www.duiduilian.com",
        "http://www.duiduilian.com/chunlian/2945.html",
        "http://www.duiduilian.com/chunlian/27zi.html"
    ]

    def parse(self, response):
        # print texts
        sel = response.selector
        re_url = response.url
        if re_url[-1] is '/':
            re_url = re_url[:-1]
            print re_url
        file = re.compile(r'/').split(re_url)[-1]
        # print file
        fp = open('/Users/myname/duilian/duilian/data/' + str(file), 'a')
        fp_record = open('/Users/myname/duilian/duilian/data/url', 'a')
        fp_record.write('\n' + re_url)

        texts_1 = sel.xpath('//div[@class="content_zw"]/text()').extract()
        for text in texts_1:
            fp.write(text.encode('utf-8'))
            fp_record.write(text.encode('utf-8'))
        fp.write('\n')

        texts_2 = sel.xpath('//div[@class="content_zw"]/div/text()').extract()
        for textp in texts_2:
            fp.write(textp.encode('utf-8'))
            fp_record.write(textp.encode('utf-8'))
        fp.write('\n')

        texts_3 = sel.xpath('//div[@class="content_zw"]/p/text()').extract()
        for textp in texts_3:
            fp.write(textp.encode('utf-8'))
            fp_record.write(textp.encode('utf-8'))
        fp.write('\n')

        texts_4 = sel.xpath('//div[@class="content_zw"]/p/font/text()').extract()
        for textp in texts_4:
            fp.write(textp.encode('utf-8'))
            fp_record.write(textp.encode('utf-8'))
        fp.write('\n')

        urls = response.xpath('//div[@class="main_box_right"]/div/a/@href').extract()
        fp.close()

        for url in urls:
            url1 = "http://www.duiduilian.com/" + str(url)
            # dui = DuilianSpider()
            # yield dui.make_requests_from_url(url1)
            yield self.make_requests_from_url(url1)

    # def make_requests_from_url(self, url):
    #     return Request(url, callback=self.parse, dont_filter=False)

当我使用内部make_requests_from_url()时,日志为:

     /Users/myname/duilian>scrapy crawl duilian 
/Users/myname/duilian/duilian/spiders/duilian_spiders.py:4: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  import scrapy.dupefilter
http://www.duiduilian.com/chunlian/2945.html
2945.html
2016-12-13 23:22:41 [scrapy] INFO: Scrapy 1.2.2 started (bot: duilian)
2016-12-13 23:22:41 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'duilian.spiders', 'SPIDER_MODULES': ['duilian.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'duilian'}
2016-12-13 23:22:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-12-13 23:22:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-13 23:22:42 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-13 23:22:42 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-13 23:22:42 [scrapy] INFO: Spider opened
2016-12-13 23:22:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-13 23:22:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6052
2016-12-13 23:22:42 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/robots.txt> (referer: None)
2016-12-13 23:22:42 [scrapy] DEBUG: Redirecting (301) to <GET http://www.duiduilian.com/daquan/> from <GET http://www.duiduilian.com/daquan>
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/2945.html> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/27zi.html> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/daquan/> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com/daquan
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jingdian/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//fojiao/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
http://www.duiduilian.com//qita
http://www.duiduilian.com//jingdian
http://www.duiduilian.com//fojiao
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jijulian/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//mingzhu/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//hengpi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jushi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
http://www.duiduilian.com//jijulian
http://www.duiduilian.com//mingzhu
http://www.duiduilian.com//hengpi
http://www.duiduilian.com//jushi
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//geyan/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/tongpianpangbushoulian.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html> (referer: http://www.duiduilian.com)
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//geyan
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/chaizihezi.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/yinzilian.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//gushi
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//qita
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//qita
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/shihua/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//qita
http://www.duiduilian.com//zhishi/chuangzuo
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//qita/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//qita/)
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//qita
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//qita/)

它可以从重复的网站抓取无限数据。但是在使用我编写的make_requests_from_url()时,它只能抓取start_urls中的href网站。日志如下:

     /Users/myname/duilian>scrapy crawl duilian 
/Users/myname/duilian/duilian/spiders/duilian_spiders.py:4: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  import scrapy.dupefilter
http://www.duiduilian.com/chunlian/2945.html
2945.html
2016-12-13 23:30:09 [scrapy] INFO: Scrapy 1.2.2 started (bot: duilian)
2016-12-13 23:30:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'duilian.spiders', 'SPIDER_MODULES': ['duilian.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'duilian'}
2016-12-13 23:30:09 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-12-13 23:30:09 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-13 23:30:09 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-13 23:30:09 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-13 23:30:09 [scrapy] INFO: Spider opened
2016-12-13 23:30:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-13 23:30:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6053
2016-12-13 23:30:09 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/robots.txt> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Redirecting (301) to <GET http://www.duiduilian.com/daquan/> from <GET http://www.duiduilian.com/daquan>
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/27zi.html> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/2945.html> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Filtered offsite request to 'www.duiduilian.com': <GET http://www.duiduilian.com//chunlian/>
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/daquan/> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com> (referer: None)
http://www.duiduilian.com/daquan
2016-12-13 23:30:10 [scrapy] INFO: Closing spider (finished)
2016-12-13 23:30:10 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1355,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 6,
 'downloader/response_bytes': 82342,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 5,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 12, 13, 15, 30, 10, 899891),
 'log_count/DEBUG': 8,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 175,
 'request_depth_max': 1,
 'response_received_count': 5,
 'scheduler/dequeued': 5,
 'scheduler/dequeued/memory': 5,
 'scheduler/enqueued': 5,
 'scheduler/enqueued/memory': 5,
 'start_time': datetime.datetime(2016, 12, 13, 15, 30, 9, 455876)}
2016-12-13 23:30:10 [scrapy] INFO: Spider closed (finished)

感谢您的帮助。

0 个答案:

没有答案