该脚本的主要目标是为网站上所有可用产品生成链接,这些产品根据类别进行隔离。
我遇到的问题是,我只能生成一个类别(注入)的链接,特别是我保存的URL。我想在其中添加第二个类别或URL:https://www.vatainc.com/wound-care.html
有没有办法让我遍历多个类别的URL,这些URL具有与我已经拥有的脚本相同的作用?
这是我的代码:
import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
all_product = []
url = "https://www.vatainc.com/infusion.html?limit=all"
service = service.Service('/Users/Jon/Downloads/chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]
for link in links:
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
products = soup.findAll("div", {"class": "product-view"})
print(links)
以下是其中的一些输出,该URL大约有52个链接。
['https://www.vatainc.com/infusion/0705-vascular-access-ultrasound-phantom-1616.html', 'https://www.vatainc.com/infusion/0751-simulated-ultrasound-blood.html', 'https://www.vatainc.com/infusion/body-skin-shell-0242.html', 'https://www.vatainc.com/infusion/2366-advanced-four-vein-venipuncture-training-aidtm-dermalike-iitm-latex-free-1533.html',
答案 0 :(得分:0)
只需使用简单的for循环枚举两个URL:
function zpad(n) {
return ("0" + n).slice(-2);
}
答案 1 :(得分:0)
您可以循环浏览两个网址。但是,如果您正在寻找一种先拉出它们然后循环遍历的方法,则可以这样做:
import time
import csv
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
import pandas as pd
root_url = 'https://www.vatainc.com/'
service = service.Service('C:\chromedriver_win32\chromedriver.exe')
service.start()
capabilities = {'chrome.binary': '/Google/Chrome/Application/chrome.exe'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(root_url)
time.sleep(2)
# Grab the urls, but only keep the ones of interest
urls = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//ol[contains(@class, 'nav-primary')]/li/a")]
urls = [ x for x in urls if 'html' in x ]
# It produces duplicates, so drop those and include ?limit=all to query all products
urls_list = pd.Series(urls).drop_duplicates().tolist()
urls_list = [ x +'?limit=all' for x in urls_list]
driver.close()
all_product = []
# loop through those urls and the links to generate a final product list
for url in urls_list:
print ('Url: '+url)
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(@class, 'product-name')]/a")]
for link in links:
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
products = soup.findAll("div", {"class": "product-view"})
all_product.append(link)
print(link)
driver.close()
产生303个链接列表