您好我正在尝试解析亚马逊网页上的书籍详情,以便我使用美丽的汤
from bs4 import BeautifulSoup
import requests
url = raw_input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
#Grab book details
print soup.find("table", {"id": "productDetailsTable" })
但是当我尝试这段代码时,我得到了无结果,我确定id productDetailsTable存在,当我尝试使用虚拟html运行这段代码时,它只能用于网址吗?
答案 0 :(得分:3)
我必须https://www.amazon.com/才能收到html数据。
这是我稍加修改的Python 3代码。
from bs4 import BeautifulSoup
import requests
url = input("Enter a website to extract the URL's from: ")
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
print(soup.text)
它打印页面的html。
你会注意到亚马逊很聪明。 html包括机器人检查:
if (true === true) {
var ue_t0 = (+ new Date()),
ue_csm = window,
ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
ue_furl = "fls-na.amazon.com",
ue_mid = "ATVPDKIKX0DER",
ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
ue_sn = "opfcaptcha.amazon.com",
ue_id = 'R8D7EEN5FVS7RWC2M549';
}
Enter the characters you see below
Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.