我想对webpage上的图形数据进行网络抓取。为此,我在 Python (Pycharm)中使用Selenium。到目前为止,这是我的代码:
from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Graph=driver.find_elements_by_id("gsc_md_hist_b")
print(Graph.text)
在必须从图表中获取信息(每年的年数和每年的引用数)之前,代码可以正常工作,答复是没有要刮擦的文本。 您能否给我一些想法,以了解如何收集所需的信息?
非常感谢, 伊万
答案 0 :(得分:0)
您可以尝试通过将xpath与类属性一起使用,然后将所有跨度测试作为列表来获取。请检查以下未经测试的代码:
from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
#Graph=driver.find_elements_by_id("gsc_md_hist_b")
#Graph=driver.find_elements_by_xpath('//div[@class=".gsc_md_hist_b"]//span[@class=".gsc_g_t"]')
Graph=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")
for spanText in Graph:
print(spanText.text)
BarValue=driver.find_elements_by_xpath("//span[@class='gsc_g_al']")
for barValueText in BarValue:
print(barValueText.text)
答案 1 :(得分:0)
要提取年份的信息,您必须为visibility_of_element_located()
引出WebDriverWait,并且可以使用以下任一Locator Strategies:
使用XPATH
:
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='gsc_rsb_cit']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']"))).click()
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='gsc_md_hist_c']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']//span[@class='gsc_g_t']")))])
控制台输出:
['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']
注意:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC