Scrapy没有爬行

时间:2015-07-28 09:58:27

标签: python python-2.7 xpath web-crawler scrapy

我在这里有最后一个问题:last question

现在我已尽力思考并改进我的蜘蛛结构。 然而,由于某些原因,我的蜘蛛仍无法开始爬行。

我还检查了xpath并且他们工作了(在chrome控制台上)。

我使用href加入了url,因为href始终只返回参数。我在上一个问题上附上了一个示例链接格式。 (我希望这篇文章不要过于冗长)

我的蜘蛛:

 class kmssSpider(scrapy.Spider):
    name='kmss'
    start_url = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#{unid=ADE682E34FC59D274825770B0037D278}'
    login_page = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login'
    allowed_domain = ["kmssqkr.hksarg"]

    def start_requests(self):
        yield Request(url=self.login_page, callback=self.login ,dont_filter = True
                )
    def login(self,response):
        return FormRequest.from_response(response,formdata={'user':'usename','password':'pw'},
                                        callback = self.check_login_response)

    def check_login_response(self,response):
        if 'Welcome' in response.body:
            self.log("\n\n\n\n Successfuly Logged in \n\n\n ")
            yield Request(url=self.start_url,
                           cookies={'LtpaToken2':'jHxHvqs+NeT...'}
                                  )
        else:
            self.log("\n\n You are not logged in \n\n " )

    def parse(self,response):
        listattheleft = response.xpath("*//*[@class='qlist']/li[not(contains(@role,'menuitem'))]")
        anyfolder = response.xpath("*//*[@class='q-folderItem']/h4")
        anyfile = response.xpath("*//*[@class='q-otherItem']/h4")
        for each_tab in listattheleft:
            item = CrawlkmssItem()
            item['url'] = each_tab.xpath('a/@href').extract()
            item['title']   = each_tab.xpath('a/text()').extract()
            yield item

            if 'unid' not in each_tab.xpath('./a').extract():
                parameter = each_tab.xpath('a/@href').extract()
                locatetheroom = parameter.find('PageLibrary')
                item['room'] = parameter[locatetheroom:]
                locatethestart = response.url.find('#',0)
                full_url = response.url[:locatethestart] + parameter
                yield Request(url=full_url,
                              cookies={'LtpaToken2':'jHxHvqs+NeT...'}
                                  )

        for folder in anyfolder:
            folderparameter = folder.xpath('a/@href').extract()
            locatethestart = response.url.find('#',0)
            folder_url = response.url[:locatethestart]+ folderparameter
            yield Request(url=folder_url, callback='parse_folder',
                          cookies={'LtpaToken2':'jHxHvqs+NeT...'}
                                  )        

        for File in anyfile:
            fileparameter = File.xpath('a/@href').extract()
            locatethestart = response.url.find('#',0)
            file_url = response.url[:locatethestart] + fileparameter
            yield Request(url=file_url, callback='parse_file',
                          cookies={'LtpaToken2':'jHxHvqs+NeT...'}
                                  )  

    def parse_folder(self,response):
        findfolder = response.xpath("//div[@class='lotusHeader']")
        folderitem= CrawlkmssFolder()
        folderitem['foldername'] = findfolder.xpath('h1/span/span/text()').extract()
        folderitem['url']= response.url[response.url.find("unid=")+5:]    
        yield folderitem


    def parse_file(self,response):
        findfile = response.xpath("//div[@class='lotusContent']")
        fileitem = CrawlkmssFile()
        fileitem['filename']=findfile.xpath('a/text()').extract()
        fileitem['title']=findfile.xpath(".//div[@class='qkrTitle']/span/@title").extract()
        fileitem['author']=findfile.xpath(".//div[@class='lotusMeta']/span[3]/span/text()").extract()
        yield fileitem

我打算抓取的信息:

左手边栏:

enter image description here

文件夹:

enter image description here

日志:

c:\Users\~\crawlKMSS>scrapy crawl kmss
2015-07-28 17:54:59 [scrapy] INFO: Scrapy 1.0.1 started (bot: crawlKMSS)
2015-07-28 17:54:59 [scrapy] INFO: Optional features available: ssl, http11, boto
2015-07-28 17:54:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'crawlKMSS.spiders', 'SPIDER_MODULES': ['crawlKMSS.spiders'], 'BOT_NAME': 'crawlKMSS'}
2015-07-28 17:54:59 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

2015-07-28 17:54:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-28 17:54:59 [boto] DEBUG: Retrieving credentials from metadata server.
2015-07-28 17:55:00 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open
    response = self._open(req, data)
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open
    '_open', req)
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain
    result = func(*args)
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2015-07-28 17:55:00 [boto] ERROR: Unable to read instance data, giving up
2015-07-28 17:55:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-28 17:55:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-28 17:55:01 [scrapy] INFO: Enabled item pipelines: 
2015-07-28 17:55:01 [scrapy] INFO: Spider opened
2015-07-28 17:55:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-28 17:55:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-28 17:55:05 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None)
2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr..hksarg/names.nsf?Login> (referer: https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login)
2015-07-28 17:55:10 [kmss] DEBUG: 



 Successfuly Logged in 



2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#%7Bunid=ADE682E34FC59D274825770B0037D278%7D> (referer: https://kmssqkr.hksarg/names.nsf?Login)
2015-07-28 17:55:10 [scrapy] INFO: Closing spider (finished)
2015-07-28 17:55:10 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1636,

会感谢任何帮助!

2 个答案:

答案 0 :(得分:1)

我认为你过于复杂,当你有scrapy.Crawler时,你为什么要从班级scrapy.Spider继承繁重的工作? Spider通常用于抓取网页列表,而Crawler用于抓取网站。

  

这是用于抓取常规网站的最常用的蜘蛛,因为它提供了一种通过定义一组规则来跟踪链接的便捷机制。

答案 1 :(得分:1)

您的日志中出现警告,并且您的追溯表明在打开httpConnection时会出现错误。

  

2015-07-28 17:54:59 [py.warnings]警告:: 0:用户警告:你没有   有一个service_identity模块的工作安装:'没有模块   命名为service_identity'。请安装它   https://pypi.python.org/pypi/service_identity并确保全部   它的依赖性得到满足。没有service_identity模块   最近有足够的pyOpenSSL来支持它,Twisted只能执行   基本的TLS客户端主机名验证。许多有效   证书/主机名映射可能会被拒绝。