我想抓取这个网站:https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name
这是我的代码:
import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item
class CasenetSpider(scrapy.Spider):
name = "casenet"
def start_requests(self):
start_urls = [
"https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse )
def parse(self, response):
data = {
inputVO.lastName: 'smith',
inputVO.firstName: 'fred',
inputVO.yearFiled: 2010,
}
yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
casenet_row = Selector(response).xpath('//tr[@align="left"]')
def parse_pages(self, response):
for row in casenet_row:
if "Part Name" not in row or "Address on File" not in row:
item = Challenge6Item()
item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
yield item
但是,我遇到此错误:
/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3: ScrapyDeprecationWarning:模块
scrapy.contrib.spiders
是 不推荐使用,而从scrapy.contrib.spiders中使用scrapy.spiders
导入规则2018-11-14 17:47:54 [scrapy.utils.log]信息:Scrapy 1.5.1 已开始(bot:Challenge6)2018-11-14 17:47:54 [scrapy.utils.log]信息: 版本:lxml 4.2.5.0,libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.1, w3lib 1.19.0,Twisted 18.9.0,Python 2.7.12(默认值,2017年12月4日, 14:50:18)-[GCC 5.4.0 20160609],pyOpenSSL 18.0.0(OpenSSL 1.1.0i 14 2018年8月),密码学2.3.1,平台 Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14 17:47:54 [scrapy.crawler]信息:覆盖的设置: {'NEWSPIDER_MODULE':'Challenge6.spiders','SPIDER_MODULES': ['Challenge6.spiders'],'ROBOTSTXT_OBEY':对,'BOT_NAME': 'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware]信息:已启用 扩展名:['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55 [scrapy.middleware] INFO:已启用下载程序中间件: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”, 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”, 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14 17:47:55 [scrapy.middleware]信息:已启用蜘蛛中间件: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55 [scrapy.middleware] INFO:启用的项目管道:[] 2018-11-14 17:47:55 [scrapy.core.engine]信息:蜘蛛开放2018-11-14 17:47:55 [scrapy.extensions.logstats]信息:抓取了0页(以0页/分钟的速度), 刮擦0件(以0件/分钟的速度)2018-11-14 17:47:55 [scrapy.extensions.telnet]调试:Telnet控制台正在监听 127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]调试:重试https://www.courts.mo.gov/robots.txt>(失败1次): [] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:重试https://www.courts.mo.gov/robots.txt>(失败2 次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:放弃重试https://www.courts.mo.gov/robots.txt> (失败3次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.robotstxt]错误:下载https://www.courts.mo.gov/robots.txt>时出错: []追溯(最近一次通话):文件 “ /usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”, 第43行,在process_request中 defer.returnValue((yield download_func(request = request,spider = spider)))ResponseNever从未收到: [] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name> (失败1次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name> (失败2次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry] 调试:放弃重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name> (失败3次):[] 2018-11-14 17:47:56 [scrapy.core.scraper]错误:错误 正在下载https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name> 追溯(最近一次通话):文件 “ /usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”, 第43行,在process_request中 defer.returnValue((yield download_func(request = request,spider = spider)))ResponseNever从未收到: [] 2018-11-14 17:47:56 [scrapy.core.engine]信息:关闭 蜘蛛(完成)2018-11-14 17:47:56 [scrapy.statscollectors]信息: 弃用Scrapy统计信息:{'downloader / exception_count':6, 'downloader / exception_type_count / twisted.web._newclient.ResponseNeverReceived': 6,'downloader / request_bytes':1455,'downloader / request_count':6, 'downloader / request_method_count / GET':6,'finish_reason': 'finished','finish_time':datetime.datetime(2018,11,14,23,47, 56,195277),'log_count / DEBUG':7,'log_count / ERROR':2, 'log_count / INFO':7,'memusage / max':52514816,'memusage / startup': 52514816,'重试/计数':4,'重试/最大到达':2, 'retry / reason_count / twisted.web._newclient.ResponseNeverReceived':4, '计划程序/已出队':3,'计划程序/已出队/内存':3, “调度程序/排队”:3,“调度程序/排队/内存”:3, 'start_time':datetime.datetime(2018,11,14,23,47,55,36009)}
我在做什么错了?