网页抓取此字段

时间:2021-04-29 01:22:49

标签: javascript python selenium web-scraping beautifulsoup

我的代码进入一个网页,并识别页面中的每个块。

每个块都包含相同的信息样式格式。

然而,当我试图获得标题时,我什么也拉不出来?

理想情况下,我想要标题、摘要和作者。

这是我目前使用 xpath 尝试标题的代码。

enter image description here

from selenium import webdriver
from bs4 import BeautifulSoup
import time
driver = webdriver.Chrome()

driver.get('https://meetinglibrary.asco.org/results?filters=JTVCJTdCJTIyZmllbGQlMjIlM0ElMjJmY3RNZWV0aW5nTmFtZSUyMiUyQyUyMnZhbHVlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclMjIlMkMlMjJxdWVyeVZhbHVlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclMjIlMkMlMjJjaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMCUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIwJTIyJTdEJTJDJTdCJTIyZmllbGQlMjIlM0ElMjJZZWFyJTIyJTJDJTIydmFsdWUlMjIlM0ElMjIyMDIxJTIyJTJDJTIycXVlcnlWYWx1ZSUyMiUzQSUyMjIwMjElMjIlMkMlMjJjaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMSUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIxJTIyJTdEJTVE')
time.sleep(4)
page_source = driver.page_source
soup=BeautifulSoup(page_source,'html.parser')
productlist=soup.find_all('div',class_='ng-star-inserted')
for item in productlist:
    title=item.find_element_by_xpath("//span[@class='ng-star-inserted']").text
    print(title)

2 个答案:

答案 0 :(得分:1)

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
wait=WebDriverWait(driver, 40)
driver.get('https://meetinglibrary.asco.org/results?filters=JTVCJTdCJTIyZmllbGQlMjIlM0ElMjJmY3RNZWV0aW5nTmFtZSUyMiUyQyUyMnZhbHVlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclMjIlMkMlMjJxdWVyeVZhbHVlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclMjIlMkMlMjJjaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMCUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIwJTIyJTdEJTJDJTdCJTIyZmllbGQlMjIlM0ElMjJZZWFyJTIyJTJDJTIydmFsdWUlMjIlM0ElMjIyMDIxJTIyJTJDJTIycXVlcnlWYWx1ZSUyMiUzQSUyMjIwMjElMjIlMkMlMjJjaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMSUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIxJTIyJTdEJTVE')
productList=wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='record']")))
for product in productList:
    title=product.find_element_by_xpath(".//span[@class='ng-star-inserted']").text
    print(title)

使用 .// 并等待元素出现。您使用的 div 类也已关闭。

输出

A post-COVID survey of current and future parents among faculty, trainees, and research staff at an...
Novel approach to improve the diagnosis of pediatric cancer in Kenya via telehealth education.
Sexual harassment of oncologists.
Overall survival with circulating tumor DNA-guided therapy in advanced non-small cell lung cancer.

另外两个是

.//div[@class='record__ellipsis']
.//span[.=' Abstract ']/following::span

答案 1 :(得分:1)

试试下面的代码,如果您有任何疑问,请告诉我 -

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
wait = WebDriverWait(driver, 60)
driver.get(
    'https://meetinglibrary.asco.org/results?filters=JTVCJTdCJTIyZmllbGQlMjIlM0ElMjJmY3RNZWV0aW5nTmFtZSUyMiUyQyUyMnZhbH'
    'VlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclMjIlMkMlMjJxdWVyeVZhbHVlJTIyJTNBJTIyQVNDTyUyMEFubnVhbCUyME1lZXRpbmclM'
    'jIlMkMlMjJjaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMCUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIwJTIyJTdEJTJDJTdCJTIy'
    'ZmllbGQlMjIlM0ElMjJZZWFyJTIyJTJDJTIydmFsdWUlMjIlM0ElMjIyMDIxJTIyJTJDJTIycXVlcnlWYWx1ZSUyMiUzQSUyMjIwMjElMjIlMkMlMjJ'
    'jaGlsZHJlbiUyMiUzQSU1QiU1RCUyQyUyMmluZGV4JTIyJTNBMSUyQyUyMm5lc3RlZFBhdGglMjIlM0ElMjIxJTIyJTdEJTVE')

AllRecords = wait.until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class=\"record\"]")))

for SingleRecord in AllRecords:
    print("Title :- " + SingleRecord.find_element_by_xpath(
        "./descendant::div[contains(@class,\"record__title\")]/span").text)
    print("Author :- " + SingleRecord.find_element_by_xpath(
        "./descendant::div[contains(text(),\"Author\")]/following-sibling::div").text)
    print("Abstract :- " + SingleRecord.find_element_by_xpath(
        "./descendant::span[contains(text(),\"Abstract\")]/parent::div/following-sibling::span").text)
    print("-------------------------------------------------")

输出看起来像 - enter image description here 如果解决了,请将其标记为答案。