Question

我正在尝试使用SCRAPY来搜索此网站对任何搜索查询的搜索要求 - http://www.bewakoof.com。

该网站使用AJAX（以XHR的形式）显示搜索结果。我设法跟踪XHR，你在我的代码中注意到它如下（在for循环中，其中我将URL存储到temp，并在循环中递增'i'） - ：< / p>

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)


spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

现在，当我执行此操作时，出现意外错误 - ：

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11 2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5} 2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: hi 2015-07-09 11:46:02 [scrapy] INFO: Spider opened 2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished) 2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3, 'downloader/request_bytes': 780, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446), 'log_count/DEBUG': 4, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)} 2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

如果你正确看到了我的代码，我也设置了DOWNLOAD_DELAY = 5，仍然会给出与我没有保留时相同的错误。我也增加了DOWNLOAD_DELAY = 10，仍然会给出相同的错误。我已经在Stack Overflow上阅读了很多与此相关的问题，也在GitHub上，但它们似乎都没有帮助。

我在其中一个答案中读到，与Polipo的TOR可以提供帮助。但是，我对使用它有点怀疑，因为我不知道使用TOR和Polipo的组合来使用Scrapy抓取网站是否合法？（我不想在任何法律问题上遇到麻烦。）这就是我不喜欢使用它的原因。因此，如果它是合法的，请使用TOR和POLIPO提供我的特定情况的代码。

或者更确切地说，如果这是非法的，请帮助我解决它而不使用它们。

请帮我解决这些错误！

编辑：

这是我更新的代码 - ：

from twisted.internet import reactor from scrapy.crawler import CrawlerProcess, CrawlerRunner import scrapy from scrapy.utils.log import configure_logging from scrapy.utils.project import get_project_settings from scrapy.settings import Settings import datetime from multiprocessing import Process, Queue import os from scrapy.http import Request from scrapy import signals from scrapy.xlib.pydispatch import dispatcher from scrapy.signalmanager import SignalManager import re query='shirt' query1=query.replace(" ", "+") class DmozItem(scrapy.Item): productname = scrapy.Field() product_link = scrapy.Field() current_price = scrapy.Field() mrp = scrapy.Field() offer = scrapy.Field() imageurl = scrapy.Field() outofstock_status = scrapy.Field() class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["http://www.bewakoof.com"] def _monkey_patching_HTTPClientParser_statusReceived(self): from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError old_sr = HTTPClientParser.statusReceived def statusReceived(self, status): try: return old_sr(self, status) except ParseError, e: if e.args[0] == 'wrong number of parts': return old_sr(self, status + ' OK') raise statusReceived.__doc__ == old_sr.__doc__ HTTPClientParser.statusReceived = statusReceived def start_requests(self): task_urls = [ ] i=1 for i in range(1,2): temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1" task_urls.append(temp) i=i+1 start_urls = (task_urls) p=len(task_urls) print 'hi' self._monkey_patching_HTTPClientParser_statusReceived() return [ Request(url = start_url) for start_url in start_urls ] print 'hi' def parse(self, response): print 'hi' print response items = [] for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'): item = DmozItem() item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6] item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2] item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2] item['mrp'] = item['current_price'] item['offer'] = str('No additional offer available') item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2] item['outofstock_status'] = str('In Stock') items.append(item) print (items) spider1 = DmozSpider() settings = Settings() settings.set("PROJECT", "dmoz") settings.set("DOWNLOAD_DELAY" , 5) crawler = CrawlerProcess(settings) crawler.crawl(spider1) crawler.start()

这是我更新的输出，如终端上显示的那样：

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11 2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5} 2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: hi 2015-07-10 13:06:01 [scrapy] INFO: Spider opened 2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished) 2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats: {'downloader/exception_count': 3, 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3, 'downloader/request_bytes': 780, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023), 'log_count/DEBUG': 4, 'log_count/ERROR': 1, 'log_count/INFO': 7, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)} 2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

所以，正如你所看到的那样，错误仍然是一样的！ :(。所以，请帮我解决这个问题！

更新 - ：

当我尝试捕获@JoeLinux建议执行的异常时的输出 - ：

>>> try: ... fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1") ... except Exception as e: ... e ... 2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] 2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>] ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],) >>> print e.reasons[0].getTraceback() Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite why = selectable.doRead() File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead return self._dataReceived(data) File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived rval = self.protocol.dataReceived(data) File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived return self._wrappedProtocol.dataReceived(data) --- <exception caught here> --- File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived self._parser.dataReceived(bytes) File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived HTTPParser.dataReceived(self, data) File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived why = self.lineReceived(line) File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived self.statusReceived(line) File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived raise ParseError("wrong number of parts", status) twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

Answer 1

我得到了同样的错误

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

现在可行。

我想你可以试试这个：

_monkey_patching_HTTPClientParser_statusReceived

，将from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError更改为from twisted.web._newclient import HTTPClientParser, ParseError;

start_requests

，为start_urls中的每个请求调用_monkey_patching_HTTPClientParser_statusReceived，例如： def start_requests(self): for url in self.start_urls: self._monkey_patching_HTTPClientParser_statusReceived() yield Request(url, dont_filter=True)

希望它有所帮助。

Answer 2

我能够在scrapy shell复制你的情况。这是我在交互式shell中收到的错误：

$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

请注意，在我放置...的位置，有一些文字并不重要。最后一行显示“零件数量错误”。经过一番谷歌搜索后，我发现了这个问题：

Error download page: twisted.python.failure.Failure 'scrapy.xlib.tx._newclient.ParseError'

最好的建议是monkeypatch。仔细阅读主题并给出一个镜头。

扭曲的Python失败 - Scrapy问题

2 个答案: