我确信这很容易,但是以某种方式,我不得不停留在href
标签下的a
链接上,该链接跳至每个产品详细信息页面。我也看不到任何JavaScript。我想念什么?
import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
urls = [
'https://undefeated.com/search?type=product&q=nike'
]
final = []
with requests.Session() as s:
for url in urls:
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='product-grid-item ']")))]
soup = bs(driver.page_source, 'lxml')
time.sleep(1)
href = soup.find_all['href']
print(href)
输出:
[]
然后我尝试了soup.find_all('a')
,它确实吐出了一大堆,包括我正在寻找的href
,但仍然不能只提取href ...
答案 0 :(得分:1)
您只需要查找所有a
标记,然后尝试打印href
属性。
您请求。会话代码应如下所示:
with requests.Session() as s:
for url in urls:
driver = webdriver.Firefox()
driver.get(url)
products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='product-grid-item ']")))]
soup = bs(driver.page_source, 'lxml')
time.sleep(1)
a_links = soup.find_all('a')
for a in a_links:
print(a.get('href'))
然后将打印所有链接。