解析亚马逊页的美丽的汤

时间:2017-03-17 18:46:19

标签: python beautifulsoup

您好我正在尝试解析亚马逊网页上的书籍详情,以便我使用美丽的汤

link:https://www.amazon.com/Dogs-Purpose-Novel-Humans/dp/0765326264/ref=sr_1_1?s=electronics&ie=UTF8&qid=1489776209&sr=1-1&keywords=books

from bs4 import BeautifulSoup
import requests

url = raw_input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

#Grab book details
print soup.find("table", {"id": "productDetailsTable" })

但是当我尝试这段代码时,我得到了无结果,我确定id productDetailsTable存在,当我尝试使用虚拟html运行这段代码时,它只能用于网址吗?

1 个答案:

答案 0 :(得分:3)

我在https://www.amazon.com

上没有看到productDetailsTable

我必须https://www.amazon.com/才能收到html数据。

这是我稍加修改的Python 3代码。

from bs4 import BeautifulSoup
import requests

url = input("Enter a website to extract the URL's from: ")
r  = requests.get(url)

data = r.text

soup = BeautifulSoup(data, "lxml")

print(soup.text)

它打印页面的html。

你会注意到亚马逊很聪明。 html包括机器人检查:

if (true === true) {
var ue_t0 = (+ new Date()),
    ue_csm = window,
    ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
    ue_furl = "fls-na.amazon.com",
    ue_mid = "ATVPDKIKX0DER",
    ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
    ue_sn = "opfcaptcha.amazon.com",
    ue_id = 'R8D7EEN5FVS7RWC2M549';
}
Enter the characters you see below
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

它让您无法阅读亚马逊的网页。您需要执行更多操作,可能需要requests并包含headerscookie信息。