我是scrapy的新手。
作为scrapy中的make_requests_from_url()
函数(如下所示),它将dont_filter
设置为true。
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
我想覆盖它以更改dont_filter
为假。
def make_requests_from_url(self, url):
return Request(url, callback=self.parse, dont_filter=False)
但它不起作用,我的意思是它无法递归抓取网站,只是立即停止。
这是我的完整代码:
# coding:utf-8
from scrapy.http import Request
import re
import scrapy.dupefilter
class DuilianSpider(scrapy.Spider):
name = "duilian"
allowed_domains = ["duilian.com"]
start_urls = [
"http://www.duiduilian.com/daquan",
"http://www.duiduilian.com",
"http://www.duiduilian.com/chunlian/2945.html",
"http://www.duiduilian.com/chunlian/27zi.html"
]
def parse(self, response):
# print texts
sel = response.selector
re_url = response.url
if re_url[-1] is '/':
re_url = re_url[:-1]
print re_url
file = re.compile(r'/').split(re_url)[-1]
# print file
fp = open('/Users/myname/duilian/duilian/data/' + str(file), 'a')
fp_record = open('/Users/myname/duilian/duilian/data/url', 'a')
fp_record.write('\n' + re_url)
texts_1 = sel.xpath('//div[@class="content_zw"]/text()').extract()
for text in texts_1:
fp.write(text.encode('utf-8'))
fp_record.write(text.encode('utf-8'))
fp.write('\n')
texts_2 = sel.xpath('//div[@class="content_zw"]/div/text()').extract()
for textp in texts_2:
fp.write(textp.encode('utf-8'))
fp_record.write(textp.encode('utf-8'))
fp.write('\n')
texts_3 = sel.xpath('//div[@class="content_zw"]/p/text()').extract()
for textp in texts_3:
fp.write(textp.encode('utf-8'))
fp_record.write(textp.encode('utf-8'))
fp.write('\n')
texts_4 = sel.xpath('//div[@class="content_zw"]/p/font/text()').extract()
for textp in texts_4:
fp.write(textp.encode('utf-8'))
fp_record.write(textp.encode('utf-8'))
fp.write('\n')
urls = response.xpath('//div[@class="main_box_right"]/div/a/@href').extract()
fp.close()
for url in urls:
url1 = "http://www.duiduilian.com/" + str(url)
# dui = DuilianSpider()
# yield dui.make_requests_from_url(url1)
yield self.make_requests_from_url(url1)
# def make_requests_from_url(self, url):
# return Request(url, callback=self.parse, dont_filter=False)
当我使用内部make_requests_from_url()时,日志为:
/Users/myname/duilian>scrapy crawl duilian
/Users/myname/duilian/duilian/spiders/duilian_spiders.py:4: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
import scrapy.dupefilter
http://www.duiduilian.com/chunlian/2945.html
2945.html
2016-12-13 23:22:41 [scrapy] INFO: Scrapy 1.2.2 started (bot: duilian)
2016-12-13 23:22:41 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'duilian.spiders', 'SPIDER_MODULES': ['duilian.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'duilian'}
2016-12-13 23:22:42 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-13 23:22:42 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-13 23:22:42 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-13 23:22:42 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-13 23:22:42 [scrapy] INFO: Spider opened
2016-12-13 23:22:42 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-13 23:22:42 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6052
2016-12-13 23:22:42 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/robots.txt> (referer: None)
2016-12-13 23:22:42 [scrapy] DEBUG: Redirecting (301) to <GET http://www.duiduilian.com/daquan/> from <GET http://www.duiduilian.com/daquan>
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/2945.html> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/27zi.html> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/daquan/> (referer: None)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com/daquan
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jingdian/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:43 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//fojiao/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
http://www.duiduilian.com//qita
http://www.duiduilian.com//jingdian
http://www.duiduilian.com//fojiao
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jijulian/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//mingzhu/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//hengpi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//jushi/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
http://www.duiduilian.com//jijulian
http://www.duiduilian.com//mingzhu
http://www.duiduilian.com//hengpi
http://www.duiduilian.com//jushi
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//geyan/> (referer: http://www.duiduilian.com/chunlian/27zi.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/tongpianpangbushoulian.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html> (referer: http://www.duiduilian.com)
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//geyan
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/chaizihezi.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/zhishi/jiqiao/yinzilian.html> (referer: http://www.duiduilian.com)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//fojiao/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//jushi/)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com/zhishi/jiqiao/diezifuzilian.html)
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//gushi
2016-12-13 23:22:44 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:44 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/dlzz/)
http://www.duiduilian.com//gushi
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//qita
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/jiqiao/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//qita
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//zhishi/shihua/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//zhishi/shihua/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/shihua
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//zhishi/dlzz
http://www.duiduilian.com//qita
http://www.duiduilian.com//zhishi/chuangzuo
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//qita/)
http://www.duiduilian.com//gushi
2016-12-13 23:22:45 [scrapy] DEBUG: Ignoring response <404 http://www.duiduilian.com//zhishi/gjhl/>: HTTP status code is not handled or not allowed
http://www.duiduilian.com//zhishi/dlzz
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//gushi/> (referer: http://www.duiduilian.com//qita/)
http://www.duiduilian.com//zhishi/xiequ
http://www.duiduilian.com//zhishi/chuangzuo
http://www.duiduilian.com//zhishi/jiqiao
http://www.duiduilian.com//qita
http://www.duiduilian.com//zhishi/shihua
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/dlzz/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (404) <GET http://www.duiduilian.com//zhishi/gjhl/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/jiqiao/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//qita/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/shihua/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/chuangzuo/> (referer: http://www.duiduilian.com//qita/)
2016-12-13 23:22:45 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com//zhishi/xiequ/> (referer: http://www.duiduilian.com//qita/)
它可以从重复的网站抓取无限数据。但是在使用我编写的make_requests_from_url()时,它只能抓取start_urls中的href网站。日志如下:
/Users/myname/duilian>scrapy crawl duilian
/Users/myname/duilian/duilian/spiders/duilian_spiders.py:4: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
import scrapy.dupefilter
http://www.duiduilian.com/chunlian/2945.html
2945.html
2016-12-13 23:30:09 [scrapy] INFO: Scrapy 1.2.2 started (bot: duilian)
2016-12-13 23:30:09 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'duilian.spiders', 'SPIDER_MODULES': ['duilian.spiders'], 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'duilian'}
2016-12-13 23:30:09 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-12-13 23:30:09 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-12-13 23:30:09 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-12-13 23:30:09 [scrapy] INFO: Enabled item pipelines:
[]
2016-12-13 23:30:09 [scrapy] INFO: Spider opened
2016-12-13 23:30:09 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-13 23:30:09 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6053
2016-12-13 23:30:09 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/robots.txt> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Redirecting (301) to <GET http://www.duiduilian.com/daquan/> from <GET http://www.duiduilian.com/daquan>
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/27zi.html> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/chunlian/2945.html> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Filtered offsite request to 'www.duiduilian.com': <GET http://www.duiduilian.com//chunlian/>
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com/daquan/> (referer: None)
2016-12-13 23:30:10 [scrapy] DEBUG: Crawled (200) <GET http://www.duiduilian.com> (referer: None)
http://www.duiduilian.com/daquan
2016-12-13 23:30:10 [scrapy] INFO: Closing spider (finished)
2016-12-13 23:30:10 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1355,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'downloader/response_bytes': 82342,
'downloader/response_count': 6,
'downloader/response_status_count/200': 5,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 12, 13, 15, 30, 10, 899891),
'log_count/DEBUG': 8,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 175,
'request_depth_max': 1,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2016, 12, 13, 15, 30, 9, 455876)}
2016-12-13 23:30:10 [scrapy] INFO: Spider closed (finished)
感谢您的帮助。