＃蜘蛛/ newegg.py

Question

我正在尝试通过scrapy从newegg mobile API下载页面。我写了这个脚本，但它不起作用。我尝试使用普通链接，脚本将响应写入文件，但使用urwe to newegg mobile API无法将响应写入文件。

＃蜘蛛/ newegg.py

class NeweggSpider(Spider):
    name = 'newegg'
    allowed_domains = ['newegg.com']
    #http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails
    start_urls = ["http://www.newegg.com/Product/Product.aspx?Item=N82E16883282695"]

    meta_page = 'newegg_spider_page'
    meta_url_tpl = 'newegg_url_template'

    def start_requests(self):
            for url in self.start_urls:
             yield Request(url, callback=self.parse_details)

    def parse_details(self, response):
        with open('log.txt', 'w') as f:
             f.write(response.body)

我无法保存自己网址的响应。

我想从http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails

下载json

我在USER_AGENT中设置了scrapy.cfg：

[settings]
default = neweggs.settings

[deploy]
url = http://localhost:6800/
project = neweggs

USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'

Scrapy stats：

2015-10-28 14:46:38 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 777,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 1430,
 'downloader/response_count': 3,
 'downloader/response_status_count/400': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 28, 12, 46, 38, 776000),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 10, 28, 12, 46, 36, 208000)}
2015-10-28 14:46:38 [scrapy] INFO: Spider closed (finished)

Answer 1

由于您在start_requests中手动发出请求，因此需要使用它明确传递User-Agent标头。适合我：

def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse_details, headers={"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3"})

Answer 2

指向＆＃34; http://www.ows.newegg.com/Products.egg/N82E16883282695/ProductDetails＆＃34;的链接返回HTTP状态为400的页面，即“错误请求”＃34;。

这就是你获得3次连接的原因，Scrapy Retry Middleware在放弃之前重试了三次页面抓取。默认情况下，Scrapy不会将HTTP状态为400的响应传回给蜘蛛。如果您愿意，请将handle_httpstatus_list = [400]添加到蜘蛛中。

Answer 3

您不需要使用scrapy.cfg来指定设置，您需要在settings.py文件上执行此操作。

<强> settings.py：

...
USER_AGENT = 'Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B179 Safari/7534.48.3'
...

如何使用Scrapy下载json Response？

＃蜘蛛/ newegg.py

3 个答案: