尝试使用Selenium从网站上的所有产品生成链接

时间:2018-11-26 14:26:12

标签: python selenium selenium-webdriver beautifulsoup

该脚本的主要目标是为网站上所有可用产品生成链接,这些产品根据类别进行隔离。

我遇到的问题是,我只能生成一个类别(注入)的链接,特别是我保存的URL。我想在其中添加第二个类别或URL:https://www.vatainc.com/wound-care.html

有没有办法让我遍历多个类别的URL,这些URL具有与我已经拥有的脚本相同的作用?

这是我的代码:

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup

all_product = []

url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jon/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]


 for link in links:
    html = requests.get(link).text
    soup = BeautifulSoup(html, "html.parser")
    products = soup.findAll("div", {"class": "product-view"})
    print(links)

以下是其中的一些输出,该URL大约有52个链接。

['https://www.vatainc.com/infusion/0705-vascular-access-ultrasound-phantom-1616.html', 'https://www.vatainc.com/infusion/0751-simulated-ultrasound-blood.html', 'https://www.vatainc.com/infusion/body-skin-shell-0242.html', 'https://www.vatainc.com/infusion/2366-advanced-four-vein-venipuncture-training-aidtm-dermalike-iitm-latex-free-1533.html',

2 个答案:

答案 0 :(得分:0)

只需使用简单的for循环枚举两个URL:

function zpad(n) {
  return ("0" + n).slice(-2);
}

答案 1 :(得分:0)

您可以循环浏览两个网址。但是,如果您正在寻找一种先拉出它们然后循环遍历的方法,则可以这样做:

import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
import pandas as pd


root_url = 'https://www.vatainc.com/'
service = service.Service('C:\chromedriver_win32\chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(root_url)
time.sleep(2)

# Grab the urls, but only keep the ones of interest
urls = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//ol[contains(@class, 'nav-primary')]/li/a")]
urls = [ x for x in urls if 'html' in x ] 

# It produces duplicates, so drop those and include ?limit=all to query all products
urls_list = pd.Series(urls).drop_duplicates().tolist()
urls_list = [ x +'?limit=all' for x in urls_list]

driver.close()


all_product = []

# loop through those urls and the links to generate a final product list
for url in urls_list:

    print ('Url: '+url)
    driver = webdriver.Remote(service.service_url, capabilities)
    driver.get(url)
    time.sleep(2)
    links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]


    for link in links:
        html = requests.get(link).text
        soup = BeautifulSoup(html, "html.parser")
        products = soup.findAll("div", {"class": "product-view"})
        all_product.append(link)
        print(link)

    driver.close()

产生303个链接列表