几天前,当我运行该代码时,它运行良好:
from bs4 import BeautifulSoup
import datetime
import requests
def getWeekMostRead(date):
nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
content = "amazon"+date.isoformat()+"_nonfiction.html"
with open(content, "w", encoding="utf-8") as nf_file:
print(nonfiction_page.content, file=nf_file)
mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")
nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")
mostread = []
for books in nonfiction:
if books.find(class_="kc-rank-card-publisher") is None:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
"",
books.find(class_="numeric-star-data").small.string.strip()
))
else:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
books.find(class_="kc-rank-card-publisher").string.strip(),
books.find(class_="numeric-star-data").small.string.strip()
))
return mostread
mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
print("Scraped data from "+date.isoformat())
mostread.extend(getWeekMostRead(date))
date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
counter = 0
print("ID,Title,Author,Publisher,Rating", file=csv)
for book in mostread:
counter += 1
print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
csv.close()
出于某种原因,今天我试图再次运行它,并将其包含在返回的HTML文件中:
To discuss automated access to Amazon data please contact api-services-support@amazon.com.\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
我了解到Amazon是大量防刮数据(或者至少我从一些回复和主题中读到了)。我试图在代码中使用标头和延迟,但是它不起作用。会有其他方法可以尝试吗?或者如果我应该等待,我应该等待多长时间?
答案 0 :(得分:5)
正如您所指出的,亚马逊非常抗刮擦。整个行业都围绕着从亚马逊抓取数据而建立,并且亚马逊拥有自己的API出售权,因此,最大的利益是阻止人们自由地从其页面中获取数据。
根据您的代码,我怀疑您提出太多请求的速度过快,并且被IP禁止。在抓取网站时,通常最好以负责任的方式抓取,方法是不要太快,轮换用户代理以及通过代理服务轮换IP。
要减少编程性,您还可以尝试随机化请求时间,以提高人性化。
即使所有这些,您仍然可能会遇到问题。亚马逊并不是一个可靠的网站。
答案 1 :(得分:0)
您可以尝试在请求的标头中添加 User-Agent 使用这个
headers = {
'User-Agent': 'My User Agent 1.0',
'From': 'personal@domain.com' # This is another valid field
}
url = "YOURLINK"
req = requests.get(url, headers=headers)
应该没问题。
答案 2 :(得分:-1)
一段时间后,我想出了解决方案。相当简单-亚马逊上没有“ 2020-01-01”,而是我将其修复为“ 2020-01-05”。