如何从Amazon抓取多个搜索结果页面详细信息?对于第一页,它工作正常,但是对于其他页面,它工作不正常,结果也不相同。
YML文件详细信息:
products:
css: 'div[data-component-type="s-search-result"]'
xpath: null
multiple: true
type: Text
children:
title:
css: 'h2 a.a-link-normal.a-text-normal'
xpath: null
type: Text
url:
css: 'h2 a.a-link-normal.a-text-normal'
xpath: null
type: Link
rating:
css: 'div.a-row.a-size-small span:nth-of-type(1)'
xpath: null
type: Attribute
attribute: aria-label
reviews:
css: 'div.a-row.a-size-small span:nth-of-type(2)'
xpath: null
type: Attribute
attribute: aria-label
price:
css: 'span.a-price:nth-of-type(1) span.a-offscreen'
xpath: null
type: Text
这是我正在使用的功能
from selectorlib import Extractor
import requests
import json
from time import sleep
e = Extractor.from_yaml_file('search_result.yml')
def scrape(url):
headers = {
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'sec-fetch-dest': 'document',
'referer': 'https://www.amazon.in/',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
# Download the page using requests
print("Downloading %s"%url)
r = requests.get(url, headers=headers)
# Simple check to check if page was blocked (Usually 503)
if r.status_code > 500:
if "To discuss automated access to Amazon data please contact" in r.text:
print("Page %s was blocked by Amazon. Please try using better proxies\n"%url)
else:
print("Page %s must have been blocked by Amazon as the status code was %d"%(url,r.status_code))
return None
# Pass the HTML of the page and create
return e.extract(r.text)
data = scrape('https://www.amazon.in/s?k=mobile')
print(data)
对于第一页,它可以正常工作,但是当单击下一页时,URL也将被动态更改,包括qid。
第二个链接的示例:'https://www.amazon.in/s?k=mobile&page=2&qid=1602337497&ref=sr_pg_2'
当我尝试运行循环时,我会像这样创建网址:'https://www.amazon.in/s?k=mobile&page= {}'.format(i)。
这也给了我结果,但与单击链接时得到的结果不一样。
如何抓取亚马逊搜索结果的多页内容?
答案 0 :(得分:0)
我能够找到它,它工作得很好,只需在页码上使用循环即可:
import requests as r
import json
page_number = 1
my_url = 'https://www.amazon.in/s/query?k=mobile&page={}&qid=1604103880&ref=sr_pg_{}'.format(page_number, page_number)
res = r.post(my_url, data={"customer-action": "pagination"}, headers={'User-Agent': 'Mozilla/5.0'})
rows = res.text.split("&&&")
for row in rows:
html_content = ''
try:
array = eval(row)
json_data = json.loads(json.dumps(array[2]))
index = json_data["index"]
html_content = json_data["html"]
except:
pass
# Perform your research, note that some rows don't concern the products
print(html_content)