如何使用python 3刮掉亚马逊

时间:2016-10-28 19:00:10

标签: python web-scraping urllib

我正在尝试阅读给定产品的所有评论,这既是为了学习python,也是为了一个项目,为了简化我的任务,我随机选择了一个产品进行编码。

我想读的链接是亚马逊,我使用urllib来打开链接

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')

当我显示亚马逊时,阅读链接到“亚马逊”变量后,我得到以下消息

print(amazon)
<http.client.HTTPResponse object at 0x000000DDB3796A20>

所以我在线阅读,发现我需要使用read命令来读取源代码,但有时候它会给我一个网页类型的结果,而不是

print(amazon.read())
b''

如何阅读页面并将其传递给美丽的汤?

编辑1

我确实使用了request.get,当我检查检索到的页面文本中的内容时,我发现了以下内容,这些内容与网站链接不匹配。

print(a2)
<html>
<head>
<title>503 - Service Unavailable Error</title>
</head>
<body bgcolor="#FFFFFF" text="#000000">

<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.in/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.in/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
-->

<center>
<a href="http://www.amazon.in/ref=cs_503_logo/">
<img src="https://images-eu.ssl-images-amazon.com/images/G/31/x-locale/communities/people/logo.gif" width=200 height=45 alt="Amazon.in" border=0></a>
<p align=center>
<font face="Verdana,Arial,Helvetica">
<font size="+2" color="#CC6600"><b>Oops!</b></font><br>
<b>It's rush hour and traffic is piling up on that page. Please try again in a short while.<br>If you were trying to place an order, it will not have been processed at this time.</b><p>

<img src="https://images-eu.ssl-images-amazon.com/images/G/02/x-locale/common/orange-arrow.gif" width=10 height=9 border=0 alt="*">
<b><a href="http://www.amazon.in/ref=cs_503_link/">Go to the Amazon.in home page to continue shopping</a></b>
</font>

</center>
</body>
</html>

2 个答案:

答案 0 :(得分:2)

使用您当前的库urllib。这就是你能做的!使用.read()来获取HTML。然后将它传递给BeautifulSoup。请记住,亚马逊是一个重型反刮网站。您获得不同结果的可能性可能是因为HTML包含在JavaScript中。为此,您可能必须使用Selenium或Dryscrape。您可能还需要将标题/ Cookie和额外属性传递到您的请求中。

amazon = urllib.request.urlopen('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
html = amazon.read()
soup = BeautifulSoup(html)

编辑----结果你现在正在使用请求。我可以使用像我这样的标题传递请求来获得200响应。

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
}
response = requests.get('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1',headers=headers)
soup = BeautifulSoup(response)
response[200]

---使用Dryscrape

import dryscrape
from bs4 import BeautifulSoup

sess = dryscrape.Session(base_url='http://www.amazon.in')
sess.visit('http://www.amazon.in/United-Colors-Benetton-Flip-Flops-Slippers/dp/B014CZA8P0/ref=pd_rhf_se_s_qp_1?_encoding=UTF8&pd_rd_i=B014CZA8P0&pd_rd_r=04RP223D4SF9BW7S2NP1&pd_rd_w=ZgGL6&pd_rd_wg=0PSZe&refRID=04RP223D4SF9BW7S2NP1')
sess.set_header('user-agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'
html = sess.body()
soup = BeautifulSoup(html)
print soup

##Should give you all the amazon HTML attributes now! I haven't tested this code keep in mind. Please refer back to dryscrape documentation for installation https://dryscrape.readthedocs.io/en/latest/apidoc.html

答案 1 :(得分:1)

我个人会使用请求库而不是urllib。请求具有更多功能

import requests

从那里开始:

resp = requests.get(url) #You can break up your paramters and pass base_url & params to this as well if you have multiple products to deal with
soup = BeautifulSoup(resp.text)

应该回答这个邮件,因为这是一个相当简单的http请求

编辑: 根据您的错误,您将不得不研究要传递的参数,以使您的请求看起来正确。一般情况下,请求它看起来像这样(显然你发现的值 - 检查你的浏览器调试/开发人员选项来检查你的网络流量,看看你在使用浏览器时发送给亚马逊的东西):

url = "https://www.base.url.here"
params = {
    'param1': 'value1'
     .....
}
resp = requests.get(url,params)