CrawlSpider在运行它被认为运行的函数之前停止

时间:2015-03-20 08:37:21

标签: python python-2.7 web-scraping scrapy

我正试图从twitter抓取数据。但我这样做有些问题。我认为myrspider无法登录twitter,但我不确定。 这是我的精确代码:

Class Clause(CrawlSpider):
    name="Clause"
    allowed_domains=['twitter.com']
    login_url=['http://twitter.com/login']
    dont_filter=True
    Rules=(
    Rule(SgmlLinkExtractor(allow= ('twittre.com.+')),callback='Myparse',follow=True),
    )
    def start_requests(self):
        print "\n\n\n start_requests\n\n\n"
        yield  Request(url=self.login_url,
        callback=self.login,
        dont_filter=True
        )
    def login(self,response):
        print "\n\n\n login is running \n\n\n"
        return FormRequest.from_response(response,
        formdata={'session[username_or_email]':'s.shahryar75@gmail.com','session[password]':'********'},
        callback=self.check_login)
    def check_login(self,response):
        print "\n\n\n    check login is running\n\n\n"
        if "SlgShahryar" in response.body:
            print "\n\n\n ************successfully logged in************\n\n\n "

            return Request(url='http://twitter.com/SlgShahryar',callback='Myparse',dont_filter=True)
        else: 
            print "\n\n\n __________authentication failed :(((( ___________ \n\n\n"

            return  
    def Myparse(self,response):
        hxs=HtmlXPathSelector(response)
        print "***************My parse is running!*********************"
        tweets=hxs.select('//li')
        items=list()
        for tweet in tweets:
            item=ClauseItem()
            item['Text']=tweets.select('//p/text()').extract()
            item['writter']=tweets.select('@data-name')

我的程序运行start_requests(),然后运行login()但是不运行check_login并退出。这是我得到的输出:

C:\Users\Shahryar\Desktop\FootBallFanFinder\crawling\Clause>scrapy crawl 
Clause -o scraped_data4.csv -t csv
2015-03-20 11:10:55+0330 [scrapy] INFO: Scrapy 0.24.5 started (bot: Clause)
2015-03-20 11:10:55+0330 [scrapy] INFO: Optional features available: ssl, http11

2015-03-20 11:10:55+0330 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'Clause.spiders', 'FEED_URI': 'scraped_data4.csv', 'DEPTH_LIMIT': 50, 'SPIDER_
MODULES': ['Clause.spiders'], 'BOT_NAME': 'Clause', 'FEED_FORMAT': 'csv', 'DOWNL
OAD_DELAY': 0.8}
2015-03-20 11:11:05+0330 [scrapy] INFO: Enabled extensions: FeedExporter, LogSta
ts, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-03-20 11:11:44+0330 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-03-20 11:11:45+0330 [scrapy] INFO: Enabled item pipelines:
2015-03-20 11:11:45+0330 [Clause] INFO: Spider opened
2015-03-20 11:11:45+0330 [Clause] INFO: Crawled 0 pages (at 0 pages/min), scrape
d 0 items (at 0 items/min)
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-03-20 11:11:45+0330 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080




 start_requests



2015-03-20 11:11:46+0330 [Clause] DEBUG: Redirecting (301) to <GET https://www.t
witter.com/login> from <GET http://www.twitter.com/login>
2015-03-20 11:11:47+0330 [Clause] DEBUG: Redirecting (301) to <GET https://twitt
er.com/login> from <GET https://www.twitter.com/login>
2015-03-20 11:11:49+0330 [Clause] DEBUG: Crawled (200) <GET https://twitter.com/
login> (referer: None)



 login is running



2015-03-20 11:11:50+0330 [Clause] DEBUG: Crawled (404) <POST https://twitter.com
/sessions/change_locale> (referer: https://twitter.com/login)
2015-03-20 11:11:50+0330 [Clause] DEBUG: Ignoring response <404 https://twitter.
com/sessions/change_locale>: HTTP status code is not handled or not allowed
2015-03-20 11:11:50+0330 [Clause] INFO: Closing spider (finished)
2015-03-20 11:11:50+0330 [Clause] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1572,
         'downloader/request_count': 4,
         'downloader/request_method_count/GET': 3,
         'downloader/request_method_count/POST': 1,
         'downloader/response_bytes': 15533,
         'downloader/response_count': 4,
         'downloader/response_status_count/200': 1,
         'downloader/response_status_count/301': 2,
         'downloader/response_status_count/404': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 3, 20, 7, 41, 50, 205000),
         'log_count/DEBUG': 7,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 4,
         'scheduler/dequeued/memory': 4,
         'scheduler/enqueued': 4,
         'scheduler/enqueued/memory': 4,
         'start_time': datetime.datetime(2015, 3, 20, 7, 41, 45, 97000)}
2015-03-20 11:11:50+0330 [Clause] INFO: Spider closed (finished)

我不确定我写的登录功能中的部分:session [username_or_email]和sessions [密码]你知道我应该在那里写什么吗?它是否正确?(由于我见过的例子,我在登录页面中写了这些字段的name属性) 请你帮助我好吗 ? 非常感谢提前。

0 个答案:

没有答案