如何通过废料获取表格

时间:2016-01-31 23:59:52

标签: python web-scraping scrapy

我正在尝试使用 scrapy spider 提交表单,以便在页面中检索更多结果。我试图看看问题是否已经研究过I found only this post。 我认为我正在做正确的事情(基于那里的解释),但日志看起来像这样(似乎尝试提交表单不成功,蜘蛛停止):

2016-02-01 00:43:40 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-01 00:43:40 [scrapy] INFO: Optional features available: ssl, http11
2016-02-01 00:43:40 [scrapy] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'stack.json'}
2016-02-01 00:43:40 [scrapy] INFO: Enabled extensions: CloseSpider, FeedExporter, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-01 00:43:40 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-01 00:43:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-01 00:43:40 [scrapy] INFO: Enabled item pipelines: 
2016-02-01 00:43:40 [scrapy] INFO: Spider opened
2016-02-01 00:43:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-01 00:43:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-01 00:43:41 [scrapy] DEBUG: Redirecting (302) to <GET http://www.experienceproject.com/dologin.php?err=t> from <GET http://www.experienceproject.com/dologinhandler.php>
2016-02-01 00:43:41 [scrapy] DEBUG: Crawled (200) <GET http://www.experienceproject.com/dologin.php?err=t> (referer: None)
2016-02-01 00:43:42 [scrapy] DEBUG: Redirecting (303) to <GET http://www.experienceproject.com/dologin.php?err=t&usernameAttempt=linguisttoo> from <POST https://www.experienceproject.com/ajax/ep/login.php>
2016-02-01 00:43:43 [scrapy] DEBUG: Crawled (200) <GET http://www.experienceproject.com/dologin.php?err=t&usernameAttempt=linguisttoo> (referer: http://www.experienceproject.com/dologin.php?err=t)
Login success
2016-02-01 00:43:43 [scrapy] DEBUG: Crawled (200) <GET http://www.experienceproject.com/about/rain0069/stories> (referer: http://www.experienceproject.com/dologin.php?err=t&usernameAttempt=linguisttoo)
2016-02-01 00:43:43 [scrapy] DEBUG: Redirecting (303) to <GET http://www.experienceproject.com/ajax/member-profile-subpage/groups?nextStartId=20&filter=&loadContentOnly=1&mid=9547778&showAdult=&=See+More+Groups> from <GET http://www.experienceproject.com/ajax/member-profile-subpage/groups?nextStartId=20&filter=&loadContentOnly=1&mid=9547778&showAdult=&=See+More+Groups>
2016-02-01 00:43:43 [scrapy] DEBUG: Filtered duplicate request: <GET http://www.experienceproject.com/ajax/member-profile-subpage/groups?nextStartId=20&filter=&loadContentOnly=1&mid=9547778&showAdult=&=See+More+Groups> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-02-01 00:43:43 [scrapy] INFO: Closing spider (finished)
2016-02-01 00:43:43 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4091,
 'downloader/request_count': 6,
 'downloader/request_method_count/GET': 5,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 32318,
 'downloader/response_count': 6,
 'downloader/response_status_count/200': 3,
 'downloader/response_status_count/302': 1,
 'downloader/response_status_count/303': 2,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 1, 31, 23, 43, 43, 949049),
 'log_count/DEBUG': 8,
 'log_count/INFO': 7,
 'request_depth_max': 3,
 'response_received_count': 3,
 'scheduler/dequeued': 6,
 'scheduler/dequeued/memory': 6,
 'scheduler/enqueued': 6,
 'scheduler/enqueued/memory': 6,
 'start_time': datetime.datetime(2016, 1, 31, 23, 43, 40, 995588)}
2016-02-01 00:43:43 [scrapy] INFO: Spider closed (finished)

我的代码在这里:

import scrapy
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.shell import inspect_response

class LoginSpider(scrapy.Spider):
    name = 'experienceproject.com'
    start_urls = ['http://www.experienceproject.com/dologinhandler.php']
    login_user = "xxx" # Login is required but I do not provide it .... I attach the page source instead to the question
    login_pass = "xxx"

    def parse(self, response):
        args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
        return FormRequest(url, method=method, formdata=args, callback=self.after_login)


    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return
        print "Login success"
        return Request(url="http://www.experienceproject.com/about/rain0069/stories",
               callback=self.parse_stories)

    def parse_stories(self,response):
#        TRY TO CLICK ON "SEE MORE STORIES"
         return scrapy.FormRequest.from_response(
         response,
         formxpath='//form[@class="pagination-form indicate-loading"]',
         clickdata={'class': 'ep-button page-btn -input'},
         callback=self.go_out)

    def go_out(self, response):
#        SEE IF SUCCESS
        inspect_response(response, self)
        return

该页面位于登录后面,因此I rather provide page source here 我有什么问题吗?

1 个答案:

答案 0 :(得分:0)

您请求被内置的重复过滤器过滤掉了。使用dont_filter=True

return scrapy.FormRequest.from_response(
     response,
     formxpath='//form[@class="pagination-form indicate-loading"]',
     clickdata={'class': 'ep-button page-btn -input'},
     callback=self.go_out,
     dont_filter=True
)