How to extract product information(title, price, review, asin) from all amazon products pages?(python, web scraping)

时间:2018-07-24 10:24:18

标签: python web-scraping

I made a scraping program that goes through all amazon products pages(there are max 24 products for each page, this is the template https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215). I run the program but it goes only for the first page. Where should I modify the code? Do I have to change the position of this line (driver.find_element_by_id("pagnNextString").click())? I attached the code. I will appreciate any help. Thank you.

THE PROGRAM

from time import sleep
from urllib.parse import urljoin
import csv
import requests
from lxml import html
from selenium import webdriver
import io

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding": "gzip, deflate, sdch, br",
    "Accept-Language": "en-US,en;q=0.8",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}

proxies = {
      'http': 'http://198.1.122.29:80',
      'https': 'http://204.52.206.65:8080'
}

chrome_options = webdriver.ChromeOptions()

chrome_options.add_argument('--proxy-server="%s"' % ';'.join(['%s=%s' % (k, v) for k, v in proxies.items()]))

driver = webdriver.Chrome(executable_path="C:\\Users\Andrei-PC\Downloads\webdriver\chromedriver.exe",
                              chrome_options=chrome_options)
header = ['Product title', 'Product price', 'Review', 'ASIN']

links = []
url = 'https://www.amazon.com/s/ref=sr_pg_1?fst=as%3Aoff&rh=n%3A1055398%2Cn%3A284507%2Cn%3A510202%2Ck%3Aas&keywords=as&ie=UTF8&qid=1532414215'

while True:
    try:
        print('Fetching url [%s]...' % url)
        response = requests.get(url, headers=headers, proxies=proxies, stream=True)
        if response.status_code == 200:
            try:
                products = driver.find_elements_by_xpath('//li[starts-with(@id, "result_")]')

                for product in products:
                    title = product.find_element_by_tag_name('h2').text
                    price = ([item.text for item in
                                  product.find_elements_by_xpath('.//a/span[contains(@class, "a-color-base")]')] + [
                                     "No price"])[0]
                    review = ([item.get_attribute('textContent') for item in
                                   product.find_elements_by_css_selector('i.a-icon-star>span.a-icon-alt')] + [
                                      "No review"])[0]
                    asin = product.get_attribute('data-asin') or "No asin"

                    try:
                        data = [title, price, review, asin]
                    except:
                        print('no items')
                    with io.open('csv/furniture.csv', "a", newline="", encoding="utf-8") as output:
                        writer = csv.writer(output)
                        writer.writerow(data)
                    driver.find_element_by_id("pagnNextString").click()
            except IndexError:
                break

    except Exception:
        print("Connection refused by the server..")
        print("Let me sleep for 5 seconds")
        print("ZZzzzz...")
        sleep(5)
        print("Was a nice sleep, now let me continue...")

1 个答案:

答案 0 :(得分:1)

url = urljoin('https://www.amazon.com', next_url)
for i in range(len(url)):
    driver.get(url[i])

这些行执行以下操作:

  1. url = urljoin('https://www.amazon.com', next_url)获取URL作为字符串,例如https://www.amazon.com/some_source并将其分配给url变量
  2. for i in range(len(url))遍历整数0, 1, 2, 3, ... len(url)的范围,并将每个整数分配给i变量
  3. driver.get(url[i])导航至字符,例如driver.get("h")driver.get("t") ...

我不知道你到底想做什么,但是我想你需要

url = urljoin('https://www.amazon.com', next_url)
driver.get(url)

更新

如果您需要检查所有页面,请尝试添加

driver.find_element_by_xpath('//a/span[@id="pagnNextString"]').click()

每页抓取后。

还请注意,for product in products永远不会导致IndexError,因此您可以避免在此循环中使用try / except