无法抓取

时间:2013-10-22 18:39:20

标签: python web-scraping scrapy

我必须从网站解析一些数据。为了获取数据,我必须登录到网站。我在scrapy中写了一个爬虫,它会登录到网站。

from scrapy.spider import BaseSpider
from scrapy.http import FormRequest

class LoginSpider(BaseSpider):
name = 'myhabit'
start_urls = ['https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T']

def parse(self, response):                  
    return [FormRequest.from_response(response,
                formdata={'E-MAIL:': 'subinthattaparambil@gmail.com', 'PASSWORD:': 'XXXXXXX'},
                callback=self.after_login)]

def after_login(self, response):
    # check login succeed before going on
    if "authentication failed" in response.body:
        self.log("Login failed", level=log.ERROR)
    else:
            self.log("Login success")           
    return

当我运行代码时会出现这样的错误

zoomcar@zoomcar-1:~/code/python/myhabit/myhabit/spiders$ scrapy crawl myhabit
2013-10-22 23:49:47+0530 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: myhabit)
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled item pipelines: 
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-10-22 23:49:47+0530 [myhabit] INFO: Spider opened
2013-10-22 23:49:49+0530 [myhabit] DEBUG: Crawled (200) <GET https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T> (referer: None)
2013-10-22 23:49:49+0530 [myhabit] ERROR: Spider error processing <https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T> (referer: <None>)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
    self.runUntilCurrent()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
    self._startRunCallbacks(result)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/media/d_drive/code/python/myhabit/myhabit/spiders/myhabit_spider.py", line 11, in parse
    callback=self.after_login)]
  File "/usr/lib/python2.7/dist-packages/scrapy/http/request/form.py", line 44, in from_response
    encoding=encoding, backwards_compat=False)
  File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 1085, in ParseFile
    return _ParseFileEx(file, base_uri, *args, **kwds)[1:]
  File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 1105, in _ParseFileEx
    fp.feed(data)
  File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 877, in feed
    raise ParseError(exc)
scrapy.xlib.ClientForm.ParseError: <ParseError instance at 0x2387e10 with str error:
 Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/python/reflect.py", line 546, in _safeFormat
    return formatter(o)
  File "/usr/lib/python2.7/HTMLParser.py", line 64, in __str__
    result = self.msg
AttributeError: 'ParseError' object has no attribute 'msg'
>

1 个答案:

答案 0 :(得分:1)

通过将Scrapy从0.12更新为0.18.4来解决该问题。