如何刮取国家网站

时间:2017-03-23 10:52:36

标签: python proxy web-scraping scrapy

我正在尝试用scrapy刮擦网站。 但是重定向到页面错误404,因为我不是来自那个国家。 如果使用代理,我也一样。 我的代码:

# -*- coding: utf-8 -*-
import scrapy

from v4.items import Product


class AcerOfficeworksAuSpider(scrapy.Spider):
    name = "acer_officeworks_au_py"

    url = 'https://www.officeworks.com.au/shop/SearchDisplay?searchTerm=acer&storeId=10151&langId=-1&pageSize=24&beginIndex=0&sType=SimpleSearch&resultCatEntryType=2&showResultsPage=true&searchSource=Q&pageView='

    def start_requests(self):
        yield scrapy.Request(self.url, self.parse, meta={'proxy': 'http://97.77.104.22:3128'})

    def parse(self, response):
        print response

结果:

2017-03-23 12:49:29 [scrapy] DEBUG: Redirecting (302) to <GET https://wc-prod-joomla.s3.amazonaws.com/404/404.html> from <GET https://www.officeworks.com.au/shop/SearchDisplay?searchTerm=acer&storeId=10151&langId=-1&pageSize=24&beginIndex=0&sType=SimpleSearch&resultCatEntryType=2&showResultsPage=true&searchSource=Q&pageView=>
2017-03-23 12:49:34 [scrapy] DEBUG: Crawled (200) <GET https://wc-prod-joomla.s3.amazonaws.com/404/404.html> (referer: None)
<200 https://wc-prod-joomla.s3.amazonaws.com/404/404.html>
2017-03-23 12:49:34 [scrapy] INFO: Closing spider (finished)

响应,如果使用curl with proxy:

HTTP/1.1 200 Connection established

HTTP/1.1 302 Security Redirect
Cache-Control: no-cache
Expires: 0
Location: https://wc-prod-joomla.s3.amazonaws.com/404/404.html
Pragma: no-cache
transfer-encoding: chunked
Connection: keep-alive

我可以尝试使其有效吗?

1 个答案:

答案 0 :(得分:0)

产品数据位于此网址:

https://www.officeworks.com.au/webapp/wcs/stores/servlet/OWPriceView?storeId=10151&catalogId=10551&nc=true&productId=90235%2C90237%2C90239%2C504502%2C532522%2C559534%2C450004%2C495002%2C315544%2C582002%2C90229%2C112392%2C450006%2C536530%2C536532%2C536534%2C536536%2C597502%2C605514%2C396502%2C423002%2C536518%2C559532%2C610502

此页面使用JavaScript从上面的URL中获取数据。

In [1]: url = '''https://www.officeworks.com.au/webapp/wcs/stores/servlet/OWPriceView?storeId=10151&catalogId=10551&nc=true&productId=90235%2C90237%2C90239%2C504502%2C5
   ...: 32522%2C559534%2C450004%2C495002%2C315544%2C582002%2C90229%2C112392%2C450006%2C536530%2C536532%2C536534%2C536536%2C597502%2C605514%2C396502%2C423002%2C536518%2C
   ...: 559532%2C610502'''

In [2]: fetch(url)
2017-03-23 20:01:00 [scrapy.core.engine] INFO: Spider opened
2017-03-23 20:01:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.officeworks.com.au/webapp/wcs/stores/servlet/OWPriceView?storeId=10151&catalogId=10551&nc=true&productId=90235%2C90237%2C90239%2C504502%2C532522%2C559534%2C450004%2C495002%2C315544%2C582002%2C90229%2C112392%2C450006%2C536530%2C536532%2C536534%2C536536%2C597502%2C605514%2C396502%2C423002%2C536518%2C559532%2C610502> (referer: None)

In [3]: import json

In [4]: json.loads(response.text)
Out[4]: 
[{'bulkbuy': True,
  'hasContractPrice': False,
  'partNumber': 'ACC120',
  'price': '$357.00',
  'priceRange': [{'maximumQuantity': '2.0',
    'minimumQuantity': '1',
    'value': {'currency': 'AUD', 'value': 357.0}},
   {'maximumQuantity': '',
    'minimumQuantity': '3',
    'value': {'currency': 'AUD', 'value': 314.0}}],
  'priceRangeExclTax': [],
  'productId': '90229'},