如何使用Scrapy formRequest抓取casenet?

时间:2018-11-15 00:14:14

标签: python-2.7 web-scraping scrapy scrapy-spider

我想抓取这个网站:https://www.courts.mo.gov/casenet/cases/searchCases.do?searchType=name

这是我的代码:

import scrapy
from scrapy.selector import Selector
from scrapy.contrib.spiders import Rule
from scrapy.linkextractors import LinkExtractor
from Challenge6.items import Challenge6Item

class CasenetSpider(scrapy.Spider):
    name = "casenet"
    def start_requests(self):
        start_urls = [
            "https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name"
        ]
        Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse", follow= True),)
        for url in start_urls:
            yield scrapy.Request(url=url, callback=self.parse )

    def parse(self, response):
        data = {
            inputVO.lastName: 'smith',
            inputVO.firstName: 'fred',
            inputVO.yearFiled: 2010,
        }
        yield scrapy.FormRequest(url="https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name", formdata=data, callback=self.parse_pages)
        casenet_row = Selector(response).xpath('//tr[@align="left"]')

    def parse_pages(self, response):
        for row in casenet_row:
            if "Part Name" not in row or "Address on File" not in row:
                item = Challenge6Item()
                item['name'] = quote.xpath('div[@class="tags"]/a[@class="tag"]/text()').extract()
                yield item

但是,我遇到此错误:

  

/var/www/html/challenge6/Challenge6/Challenge6/spiders/casenet_crawler.py:3:   ScrapyDeprecationWarning:模块scrapy.contrib.spiders是   不推荐使用,而从scrapy.contrib.spiders中使用scrapy.spiders   导入规则2018-11-14 17:47:54 [scrapy.utils.log]信息:Scrapy 1.5.1   已开始(bot:Challenge6)2018-11-14 17:47:54 [scrapy.utils.log]信息:   版本:lxml 4.2.5.0,libxml2 2.9.8,cssselect 1.0.3,parsel 1.5.1,   w3lib 1.19.0,Twisted 18.9.0,Python 2.7.12(默认值,2017年12月4日,   14:50:18)-[GCC 5.4.0 20160609],pyOpenSSL 18.0.0(OpenSSL 1.1.0i 14   2018年8月),密码学2.3.1,平台   Linux-4.4.0-1066-aws-x86_64-with-Ubuntu-16.04-xenial 2018-11-14   17:47:54 [scrapy.crawler]信息:覆盖的设置:   {'NEWSPIDER_MODULE':'Challenge6.spiders','SPIDER_MODULES':   ['Challenge6.spiders'],'ROBOTSTXT_OBEY':对,'BOT_NAME':   'Challenge6'} 2018-11-14 17:47:55 [scrapy.middleware]信息:已启用   扩展名:['scrapy.extensions.memusage.MemoryUsage',   'scrapy.extensions.logstats.LogStats',   'scrapy.extensions.telnet.TelnetConsole',   'scrapy.extensions.corestats.CoreStats'] 2018-11-14 17:47:55   [scrapy.middleware] INFO:已启用下载程序中间件:   ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',   'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',   “ scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware”,   'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',   'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',   'scrapy.downloadermiddlewares.retry.RetryMiddleware',   'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',   “ scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware”,   'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',   'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',   'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',   'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-11-14   17:47:55 [scrapy.middleware]信息:已启用蜘蛛中间件:   ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',   'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',   'scrapy.spidermiddlewares.referer.RefererMiddleware',   'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',   'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-11-14 17:47:55   [scrapy.middleware] INFO:启用的项目管道:[] 2018-11-14   17:47:55 [scrapy.core.engine]信息:蜘蛛开放2018-11-14 17:47:55   [scrapy.extensions.logstats]信息:抓取了0页(以0页/分钟的速度),   刮擦0件(以0件/分钟的速度)2018-11-14 17:47:55   [scrapy.extensions.telnet]调试:Telnet控制台正在监听   127.0.0.1:6023 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]调试:重试https://www.courts.mo.gov/robots.txt>(失败1次):   [] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]   调试:重试https://www.courts.mo.gov/robots.txt>(失败2   次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]   调试:放弃重试https://www.courts.mo.gov/robots.txt>   (失败3次):[] 2018-11-14 17:47:55   [scrapy.downloadermiddlewares.robotstxt]错误:下载https://www.courts.mo.gov/robots.txt>时出错:   []追溯(最近一次通话):文件   “ /usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”,   第43行,在process_request中       defer.returnValue((yield download_func(request = request,spider = spider)))ResponseNever从未收到:   [] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]   调试:重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>   (失败1次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]   调试:重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>   (失败2次):[] 2018-11-14 17:47:55 [scrapy.downloadermiddlewares.retry]   调试:放弃重试https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>   (失败3次):[] 2018-11-14 17:47:56 [scrapy.core.scraper]错误:错误   正在下载https://www.courts.mo.gov/casenet/cases/nameSearch.do?searchType=name>   追溯(最近一次通话):文件   “ /usr/local/lib/python2.7/dist-packages/scrapy/core/downloader/middleware.py”,   第43行,在process_request中       defer.returnValue((yield download_func(request = request,spider = spider)))ResponseNever从未收到:   [] 2018-11-14 17:47:56 [scrapy.core.engine]信息:关闭   蜘蛛(完成)2018-11-14 17:47:56 [scrapy.statscollectors]信息:   弃用Scrapy统计信息:{'downloader / exception_count':6,   'downloader / exception_type_count / twisted.web._newclient.ResponseNeverReceived':   6,'downloader / request_bytes':1455,'downloader / request_count':6,   'downloader / request_method_count / GET':6,'finish_reason':   'finished','finish_time':datetime.datetime(2018,11,14,23,47,   56,195277),'log_count / DEBUG':7,'log_count / ERROR':2,   'log_count / INFO':7,'memusage / max':52514816,'memusage / startup':   52514816,'重试/计数':4,'重试/最大到达':2,   'retry / reason_count / twisted.web._newclient.ResponseNeverReceived':4,   '计划程序/已出队':3,'计划程序/已出队/内存':3,   “调度程序/排队”:3,“调度程序/排队/内存”:3,   'start_time':datetime.datetime(2018,11,14,23,47,55,36009)}

我在做什么错了?

0 个答案:

没有答案