我正在尝试从该网站中提取有关某个主题(例如机器学习)的文章(它们的标题/链接)。 https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance
我需要访问的div标签嵌套在其他几个div标签下。
这是我到目前为止尝试过的。我得到空名单。任何帮助表示赞赏。
import time
from selenium import webdriver
# Get all the paper url in the search result
def paper_crawler():
driver = webdriver.Firefox('path')
driver.get ('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry')
result_counts = driver.find_elements_by_xpath('//*[@class="result-count"]')
print(result_counts)
for item in result_counts:
count = item.text
print(count)
#search_result_urls = driver.find_elements_by_xpath('.//div[contains(@class,"result-page")]/article/header/div/a')
search_result_urls = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a')
print(search_result_urls)
for item in search_result_urls:
paper_url = item.get_attribute('href')
print(paper_url)
search_result_titles = driver.find_elements_by_xpath('//*[@class="result-page"]/article/header/div/a/span')
for item in search_result_titles:
paper_title = item.text
print(paper_title)
time.sleep(2)
if __name__ == '__main__':
paper_crawler ()
答案 0 :(得分:1)
更好地使用import cv2
cap = cv2.VideoCapture('video.mp4')
img = cv2.imread('test.jpg', -1)
if (cap.isOpened() == False):
print("Error opening video stream or file")
while (cap.isOpened()):
ret, frame = cap.read()
if ret == True:
cv2.imshow('Frame', frame)
# add waitKey for video to display
cv2.waitKey(1)
if cv2.waitKey(25) == ord('q'):
# do not close window, you want to show the frame
# cv2.destroyAllWindows();
break
else:
break
cap.release()
cv2.imshow('Frame', img) # note the difference
k = cv2.waitKey(0) & 0xFF
if k == 27:
cv2.destroyAllWindows()
,让您的生活更轻松。解析任何内容。
API
答案 1 :(得分:0)
当您开始搜索元素时,页面已加载但未完全呈现。
之后的“ time.sleep(5)”driver.get('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry') 应该可以提供快速解决方法。
对于更好,更健壮的解决方案,您应该等待result_counts大于0几秒钟或该页面为错误页面(https://www.semanticscholar.org/search?q=learning333&sort=relevance&fos=chemistry)。
答案 2 :(得分:0)
要提取文章的 Title 和 HREF 属性,您必须为visibility_of_all_elements_located()
引入 WebDriverWait ,您可以使用以下Locator Strategies:
代码块:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.semanticscholar.org/search?q=machine%20learning&sort=relevance&fos=chemistry')
my_titles = [my_elem.text for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-selenium-selector='title-link']>span")))]
my_hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-selenium-selector='title-link']")))]
for i,j in zip(my_titles, my_hrefs):
print("{} link is {}".format(i, j))
driver.quit()
控制台输出:
UCI Repository of Machine Learning Databases link is https://www.semanticscholar.org/paper/UCI-Repository-of-Machine-Learning-Databases-Blake/e068be31ded63600aea068eacd12931efd2a1029
Energy landscapes for machine learning. link is https://www.semanticscholar.org/paper/Energy-landscapes-for-machine-learning.-Ballard-Das/735d4099d3be0d919ddedb054043e6763205e0f7
Finding Nature′s Missing Ternary Oxide Compounds Using Machine Learning and Density Functional Theory. link is https://www.semanticscholar.org/paper/Finding-Nature%E2%80%B2s-Missing-Ternary-Oxide-Compounds-Hautier-Fischer/e3ab9e1162fc8f63d215dfdb21801ef5e1fde7b5
Distributed secure quantum machine learning link is https://www.semanticscholar.org/paper/Distributed-secure-quantum-machine-learning-Sheng-Zhou/ef944614bfc82b1dedfea19ff249a97ceea5ad90
Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. link is https://www.semanticscholar.org/paper/Neural-Symbolic-Machine-Learning-for-Retrosynthesis-Segler-Waller/71cc9eefb17d7c4d1062162523b5fdad7ca66a2a
Transferable Machine-Learning Model of the Electron Density link is https://www.semanticscholar.org/paper/Transferable-Machine-Learning-Model-of-the-Electron-Grisafi-Fabrizio/f809258b65a00a06f9584e76620e6c6395cf81eb
Crystal structure representations for machine learning models of formation energies link is https://www.semanticscholar.org/paper/Crystal-structure-representations-for-machine-of-Faber-Lindmaa/1bdca98dc8c730ee92d5b19d2973a5bf461a500a
Machine learning for quantum mechanics in a nutshell link is https://www.semanticscholar.org/paper/Machine-learning-for-quantum-mechanics-in-a-Rupp/29b9ff8f4a26acc90e6182e1e749f15f688bc7cf
Machine-Learning-Augmented Chemisorption Model for CO2 Electroreduction Catalyst Screening. link is https://www.semanticscholar.org/paper/Machine-Learning-Augmented-Chemisorption-Model-for-Ma-Li/d6f30032c8fac43a8eabf2b67d2e84db6d3d0409
Adaptive machine learning framework to accelerate ab initio molecular dynamics link is https://www.semanticscholar.org/paper/Adaptive-machine-learning-framework-to-accelerate-Botu-Ramprasad/c9934d684fcc0b8ac6ed25b34d96e726cf2d7b99