我的代码进入一个包含多个条目的网页,获取它们的 URL,然后将它们放入一个列表中。
然后它 1 乘 1 浏览每个 URL 列表,然后对每个演示文稿进行 scape。
现在我抓取了每个演示文稿的每个标题(您可以查看是否运行代码),但在标题中,还有另一个我想要的 URL/href。
有没有办法刮这个?
谢谢
from selenium import webdriver
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
val=[]
driver = webdriver.Chrome()
for x in range (1,3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/sessions/@sessiontype=Advances%20in%20Diagnostics%20and%20Therapeutics/{x}')
time.sleep(9)
page_source = driver.page_source
eachrow = ["https://www.abstractsonline.com/pp8/#!/9325/session/" + x.get_attribute('data-id') for x in driver.find_elements_by_xpath('//*[@id="results"]/li//h1[@class="name"]')]
for row in eachrow:
val.append(row)
print(row)
for b in val:
driver.get(b)
time.sleep(3)
page_source1=driver.page_source
soup=BeautifulSoup(page_source1,'html.parser')
productlist=soup.find_all('a',class_='title color-primary')
for item in productlist:
presentationTitle=item.text.strip()
print(presentationTitle)
答案 0 :(得分:1)
我认为您需要一些等待条件,然后为页面中的每个演示文稿提取 href 属性
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
base = 'https://www.abstractsonline.com/pp8/#!/9325/session/'
for x in range (1, 3):
driver.get(f'https://www.abstractsonline.com/pp8/#!/9325/sessions/@sessiontype=Advances%20in%20Diagnostics%20and%20Therapeutics/{x}')
links = [base + i.get_attribute('data-id') for i in WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "li .name")))]
for link in links:
driver.get(link)
print(WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID, "spnSessionTitle"))).text)
for presentation in driver.find_elements_by_css_selector('.title'):
print(presentation.text.strip())
print('https://www.abstractsonline.com/pp8' + presentation.get_attribute('href'))
答案 1 :(得分:0)
links = driver.find_elements_by_partial_link_text('https://yourlinks.com/?action=')
for link in links:
print(link.get_attribute("href"))