我必须从网站解析一些数据。为了获取数据,我必须登录到网站。我在scrapy中写了一个爬虫,它会登录到网站。
from scrapy.spider import BaseSpider
from scrapy.http import FormRequest
class LoginSpider(BaseSpider):
name = 'myhabit'
start_urls = ['https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'E-MAIL:': 'subinthattaparambil@gmail.com', 'PASSWORD:': 'XXXXXXX'},
callback=self.after_login)]
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.log("Login failed", level=log.ERROR)
else:
self.log("Login success")
return
当我运行代码时会出现这样的错误
zoomcar@zoomcar-1:~/code/python/myhabit/myhabit/spiders$ scrapy crawl myhabit
2013-10-22 23:49:47+0530 [scrapy] INFO: Scrapy 0.12.0.2546 started (bot: myhabit)
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, DownloaderStats
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Enabled item pipelines:
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-10-22 23:49:47+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-10-22 23:49:47+0530 [myhabit] INFO: Spider opened
2013-10-22 23:49:49+0530 [myhabit] DEBUG: Crawled (200) <GET https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T> (referer: None)
2013-10-22 23:49:49+0530 [myhabit] ERROR: Spider error processing <https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&pageId=quarterdeck&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&clientContext=183-6909322-8613518&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&marketPlaceId=A39WRC2IB8YGEK&openid.assoc_handle=quarterdeck&openid.return_to=https%3A%2F%2Fwww.myhabit.com%2Fsignin&&siteState=http%3A%2F%2Fwww.myhabit.com%2Fhomepage%3Fhash%3Dpage%253Db%2526dept%253Dwomen%2526sale%253DA1VZ6QH7N57X0T%2526ref%253Dqd_nav_women_cur_0_A1VZ6QH7N57X0T> (referer: <None>)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 1178, in mainLoop
self.runUntilCurrent()
File "/usr/lib/python2.7/dist-packages/twisted/internet/base.py", line 800, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
self._startRunCallbacks(result)
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/media/d_drive/code/python/myhabit/myhabit/spiders/myhabit_spider.py", line 11, in parse
callback=self.after_login)]
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/form.py", line 44, in from_response
encoding=encoding, backwards_compat=False)
File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 1085, in ParseFile
return _ParseFileEx(file, base_uri, *args, **kwds)[1:]
File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 1105, in _ParseFileEx
fp.feed(data)
File "/usr/lib/python2.7/dist-packages/scrapy/xlib/ClientForm.py", line 877, in feed
raise ParseError(exc)
scrapy.xlib.ClientForm.ParseError: <ParseError instance at 0x2387e10 with str error:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/python/reflect.py", line 546, in _safeFormat
return formatter(o)
File "/usr/lib/python2.7/HTMLParser.py", line 64, in __str__
result = self.msg
AttributeError: 'ParseError' object has no attribute 'msg'
>
答案 0 :(得分:1)
通过将Scrapy从0.12更新为0.18.4来解决该问题。