我正试图从twitter抓取数据。但我这样做有些问题。我认为myrspider无法登录twitter,但我不确定。 这是我的精确代码:
Class Clause(CrawlSpider):
name="Clause"
allowed_domains=['twitter.com']
login_url=['http://twitter.com/login']
dont_filter=True
Rules=(
Rule(SgmlLinkExtractor(allow= ('twittre.com.+')),callback='Myparse',follow=True),
)
def start_requests(self):
print "\n\n\n start_requests\n\n\n"
yield Request(url=self.login_url,
callback=self.login,
dont_filter=True
)
def login(self,response):
print "\n\n\n login is running \n\n\n"
return FormRequest.from_response(response,
formdata={'session[username_or_email]':'s.shahryar75@gmail.com','session[password]':'********'},
callback=self.check_login)
def check_login(self,response):
print "\n\n\n check login is running\n\n\n"
if "SlgShahryar" in response.body:
print "\n\n\n ************successfully logged in************\n\n\n "
return Request(url='http://twitter.com/SlgShahryar',callback='Myparse',dont_filter=True)
else:
print "\n\n\n __________authentication failed :(((( ___________ \n\n\n"
return
def Myparse(self,response):
hxs=HtmlXPathSelector(response)
print "***************My parse is running!*********************"
tweets=hxs.select('//li')
items=list()
for tweet in tweets:
item=ClauseItem()
item['Text']=tweets.select('//p/text()').extract()
item['writter']=tweets.select('@data-name')
我的程序运行start_requests()
,然后运行login()
但是不运行check_login并退出。这是我得到的输出:
C:\Users\Shahryar\Desktop\FootBallFanFinder\crawling\Clause>scrapy crawl
Clause -o scraped_data4.csv -t csv
2015-03-20 11:10:55+0330 [scrapy] INFO: Scrapy 0.24.5 started (bot: Clause)
2015-03-20 11:10:55+0330 [scrapy] INFO: Optional features available: ssl, http11
2015-03-20 11:10:55+0330 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'Clause.spiders', 'FEED_URI': 'scraped_data4.csv', 'DEPTH_LIMIT': 50, 'SPIDER_
MODULES': ['Clause.spiders'], 'BOT_NAME': 'Clause', 'FEED_FORMAT': 'csv', 'DOWNL
OAD_DELAY': 0.8}
2015-03-20 11:11:05+0330 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-03-20 11:11:45+0330 [scrapy] INFO: Enabled item pipelines:
2015-03-20 11:11:45+0330 [Clause] INFO: Spider opened
2015-03-20 11:11:45+0330 [Clause] INFO: Crawled 0 pages (at 0 pages/min), scrape
d 0 items (at 0 items/min)
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
start_requests
2015-03-20 11:11:46+0330 [Clause] DEBUG: Redirecting (301) to <GET https://www.t
witter.com/login> from <GET http://www.twitter.com/login>
2015-03-20 11:11:47+0330 [Clause] DEBUG: Redirecting (301) to <GET https://twitt
er.com/login> from <GET https://www.twitter.com/login>
2015-03-20 11:11:49+0330 [Clause] DEBUG: Crawled (200) <GET https://twitter.com/
login> (referer: None)
login is running
2015-03-20 11:11:50+0330 [Clause] DEBUG: Crawled (404) <POST https://twitter.com
/sessions/change_locale> (referer: https://twitter.com/login)
2015-03-20 11:11:50+0330 [Clause] DEBUG: Ignoring response <404 https://twitter.
com/sessions/change_locale>: HTTP status code is not handled or not allowed
2015-03-20 11:11:50+0330 [Clause] INFO: Closing spider (finished)
2015-03-20 11:11:50+0330 [Clause] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1572,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 15533,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/301': 2,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 3, 20, 7, 41, 50, 205000),
'log_count/DEBUG': 7,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2015, 3, 20, 7, 41, 45, 97000)}
2015-03-20 11:11:50+0330 [Clause] INFO: Spider closed (finished)
我不确定我写的登录功能中的部分:session [username_or_email]和sessions [密码]你知道我应该在那里写什么吗?它是否正确?(由于我见过的例子,我在登录页面中写了这些字段的name属性) 请你帮助我好吗 ? 非常感谢提前。