我正在尝试使用以下Python代码从链接中获取培训列表:
from selenium import webdriver
url = 'https://www.cbtnuggets.com/search'
browser = webdriver.Chrome()
browser.get(url)
browser.implicitly_wait(30)
print(browser.find_element_by_tag_name("table").text)
browser.quit()
通常,我只是将表头作为输出:
课程标题培训师评估供应商的IT路径技能水平
但是此输出不一致,一次或两次(共20次尝试)打印了整个表格(列出了该网页上的所有培训),但是我无法获得一致的输出。
我在30-60秒之间调整了hidden_wait(30),但没有解决方法。我还可以看到,在30秒的计时器内AJAX内容加载得很好。
我的要求:
https://www.cbtnuggets.com/it-training/isc2-cissp-2015
因此输出应具有以下表头
答案 0 :(得分:0)
尝试此操作以获取必需的内容。无论您是否等待获取表,表始终都带有标题。但是,主体内容是动态生成的,因此您应该使脚本等待该内容可用。
from selenium import webdriver
url = 'https://www.cbtnuggets.com/search'
browser = webdriver.Chrome()
browser.get(url)
browser.implicitly_wait(30)
for items in browser.find_elements_by_css_selector("table tbody tr"):
data = [item.get_attribute("href") for item in items.find_elements_by_css_selector("a")]
print(data)
browser.quit()
要缩短执行时间,可以将BeautifulSoup
与selenium
结合使用,如下所示:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.cbtnuggets.com/search'
driver = webdriver.Chrome()
driver.get(url)
driver.implicitly_wait(30)
table = driver.find_elements_by_css_selector("table tbody")
soup = BeautifulSoup(driver.page_source,"lxml") #if you haven't installed "lxml" yet, try replacing it with "html.parser"
for items in soup.select("table tbody tr"):
data = [item.get("href") for item in items.select("a")]
print(data)
driver.quit()
答案 1 :(得分:0)
您可以通过重新创建页面上的XHR API请求来完成此操作,该请求将检索目录信息并处理JSON响应。我欢迎提出有关如何删除data
上重复循环的建议。我曾考虑过在单个循环中使用拆包,但是即使工作确实可行,也很难遵循。但是,它仍然很快。
import requests
import pandas as pd
base = 'https://www.cbtnuggets.com/it-training/'
response = requests.get('https://api.cbtnuggets.com/site-gateway/v1/all/courses/for/search?archive=false')
data = response.json()
titles = [item['title'] for item in data]
trainers = [item['trainers'][0]['name'] for item in data]
ratings = [item['rating'] for item in data]
vendors = [item['vendors'][0]['display'] if len(item['vendors']) != 0 else 'N/A' for item in data]
paths = [item['paths'][0]['path_label'] for item in data]
skillLevel = [item['difficulty']['display'] for item in data]
links = [base + item['seoslug'] for item in data]
df= pd.DataFrame(
{'Course Title': titles,
'Trainer': trainers,
'Rating': ratings,
'Vendor': vendors,
'IT Path': paths,
'Skill Level': skillLevel,
'Course URL': links
})
print(df)
df.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf-8',index = False )