Scrapy - 如何避免错误426?

时间:2017-11-03 14:37:52

标签: python web-scraping scrapy

我正在使用GET表单提交URL来废弃网站上的数据,但我提交的每个请求都被重定向(通过302)到426升级要求错误页面。

我检查了我的网址是否正确,确实如此。当我将我的代码请求的确切网址粘贴到Google Chrome时,它会按照我的预期返回一个json。但是当Scrapy请求相同的URL时,它最终会被重定向到426 Upgrade Required错误页面。

以下是我的Scrapy代码:

import scrapy
import ast

class TestSpider(scrapy.Spider):
    name = "Test"
    global zipcodes
    zipcodes = ['35201', '99501']

    def start_requests(self):
        global zipcodes
        for zipcode in zipcodes:
            url = 'https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet?skipProxy=true&rt=sel&usage=ss&loc=en_US&MDR=true&ss.proximityLocator=true&ss.milesConverter=1.609344&ss.selectoruniqueid=1509663605706&ss.displayLinks=&viewResultsThresHold=9000&includeRecords=true&No=0&resultsPerPage=40&baseN=4293669528&N=4293669528&_=1509663558936&ss.searchInterface=&Ntx=mode%2520matchall&Ntt={{zip}}&distance=+100'
            url = url.replace("{{zip}}", zipcode)
            request = scrapy.Request(url,callback=self.parse_json)
            yield request

    def parse_json(self, response):
        stores = ast.literal_eval(response.body)
        import pdb; pdb.set_trace()

下面是我运行蜘蛛时得到的输出/日志。我认为只有问题可能是问题,但我不知道如何解决(我是Python / Scrapy的初学者)是因为我没有service_identity模块的工作安装。

MacBook-Air:spiders khanjan$ scrapy crawl Test -o test.csv
:0: UserWarning: You do not have a working installation of the service_identity module: 'No module named cryptography.x509'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
2017-11-03 02:03:08 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: tutorial)
2017-11-03 02:03:08 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'csv', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'Test.csv', 'BOT_NAME': 'tutorial'}
2017-11-03 02:03:08 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2017-11-03 02:03:08 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-11-03 02:03:08 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-11-03 02:03:08 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-11-03 02:03:08 [scrapy.core.engine] INFO: Spider opened
2017-11-03 02:03:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-11-03 02:03:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6029
https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet?skipProxy=true&rt=sel&usage=ss&loc=en_US&MDR=true&ss.proximityLocator=true&ss.milesConverter=1.609344&ss.selectoruniqueid=1509663605706&ss.displayLinks=&viewResultsThresHold=9000&includeRecords=true&No=0&resultsPerPage=40&baseN=4293669528&N=4293669528&_=1509663558936&ss.searchInterface=&Ntx=mode%2520matchall&Ntt=35201&distance=+100
https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet?skipProxy=true&rt=sel&usage=ss&loc=en_US&MDR=true&ss.proximityLocator=true&ss.milesConverter=1.609344&ss.selectoruniqueid=1509663605706&ss.displayLinks=&viewResultsThresHold=9000&includeRecords=true&No=0&resultsPerPage=40&baseN=4293669528&N=4293669528&_=1509663558936&ss.searchInterface=&Ntx=mode%2520matchall&Ntt=99501&distance=+100
2017-11-03 02:03:08 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.anondomain.com"; u'solutions.anondomain.com'!=u'www.anondomain.com'
2017-11-03 02:03:08 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.anondomain.com"; u'solutions.anondomain.com'!=u'www.anondomain.com'
2017-11-03 02:03:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d35201%26distance%3d+100> from <GET https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet?skipProxy=true&rt=sel&usage=ss&loc=en_US&MDR=true&ss.proximityLocator=true&ss.milesConverter=1.609344&ss.selectoruniqueid=1509663605706&ss.displayLinks=&viewResultsThresHold=9000&includeRecords=true&No=0&resultsPerPage=40&baseN=4293669528&N=4293669528&_=1509663558936&ss.searchInterface=&Ntx=mode%2520matchall&Ntt=35201&distance=+100>
2017-11-03 02:03:08 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "safe-browsing.anondomain.com"; u'*.anondomain.com'!=u'safe-browsing.anondomain.com'
2017-11-03 02:03:09 [scrapy.core.engine] DEBUG: Crawled (426) <GET https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d35201%26distance%3d+100> (referer: None)
2017-11-03 02:03:09 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <426 https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d35201%26distance%3d+100>: HTTP status code is not handled or not allowed
2017-11-03 02:03:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d99501%26distance%3d+100> from <GET https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet?skipProxy=true&rt=sel&usage=ss&loc=en_US&MDR=true&ss.proximityLocator=true&ss.milesConverter=1.609344&ss.selectoruniqueid=1509663605706&ss.displayLinks=&viewResultsThresHold=9000&includeRecords=true&No=0&resultsPerPage=40&baseN=4293669528&N=4293669528&_=1509663558936&ss.searchInterface=&Ntx=mode%2520matchall&Ntt=99501&distance=+100>
2017-11-03 02:03:12 [scrapy.core.engine] DEBUG: Crawled (426) <GET https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d99501%26distance%3d+100> (referer: None)
2017-11-03 02:03:12 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <426 https://safe-browsing.anondomain.com/invalid-protocol/notallowed.php?TLSOriginURL=https://www.anondomain.com/wps/PA_Snaps286/AjaxServlet%3fskipProxy%3dtrue%26rt%3dsel%26usage%3dss%26loc%3den_US%26MDR%3dtrue%26ss.proximityLocator%3dtrue%26ss.milesConverter%3d1.609344%26ss.selectoruniqueid%3d1509663605706%26ss.displayLinks%3d%26viewResultsThresHold%3d9000%26includeRecords%3dtrue%26No%3d0%26resultsPerPage%3d40%26baseN%3d4293669528%26N%3d4293669528%26_%3d1509663558936%26ss.searchInterface%3d%26Ntx%3dmode%252520matchall%26Ntt%3d99501%26distance%3d+100>: HTTP status code is not handled or not allowed
2017-11-03 02:03:12 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-03 02:03:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4205,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 44879,
 'downloader/response_count': 4,
 'downloader/response_status_count/302': 2,
 'downloader/response_status_count/426': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 11, 3, 6, 3, 12, 261670),
 'httperror/response_ignored_count': 2,
 'httperror/response_ignored_status_count/426': 2,
 'log_count/DEBUG': 5,
 'log_count/INFO': 9,
 'log_count/WARNING': 3,
 'memusage/max': 30736384,
 'memusage/startup': 30736384,
 'response_received_count': 2,
 'scheduler/dequeued': 4,
 'scheduler/dequeued/memory': 4,
 'scheduler/enqueued': 4,
 'scheduler/enqueued/memory': 4,
 'start_time': datetime.datetime(2017, 11, 3, 6, 3, 8, 187267)}
2017-11-03 02:03:12 [scrapy.core.engine] INFO: Spider closed (finished)

我该如何解决这个问题?

0 个答案:

没有答案