beautifulsoup href返回空字符串

时间:2019-06-12 01:39:42

标签: web-scraping beautifulsoup

我确信这很容易,但是以某种方式,我不得不停留在href标签下的a链接上,该链接跳至每个产品详细信息页面。我也看不到任何JavaScript。我想念什么?

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

urls = [
    'https://undefeated.com/search?type=product&q=nike'
] 

 final = []
with requests.Session() as s:
    for url in urls:
        driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='product-grid-item  ']")))]
        soup = bs(driver.page_source, 'lxml')
        time.sleep(1)
        href = soup.find_all['href']
        print(href)

输出: []

然后我尝试了soup.find_all('a'),它确实吐出了一大堆,包括我正在寻找的href,但仍然不能只提取href ...

1 个答案:

答案 0 :(得分:1)

您只需要查找所有a标记,然后尝试打印href属性。

您请求。会话代码应如下所示:

with requests.Session() as s:
    for url in urls:
        driver = webdriver.Firefox()
        driver.get(url)
        products = [element for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='product-grid-item  ']")))]
        soup = bs(driver.page_source, 'lxml')
        time.sleep(1)
        a_links = soup.find_all('a')
        for a in a_links:
            print(a.get('href'))

然后将打印所有链接。