使用python Web抓图的数据

时间:2020-07-20 06:40:28

标签: python selenium selenium-webdriver xpath webdriverwait

我想对webpage上的图形数据进行网络抓取。为此,我在 Python (Pycharm)中使用Selenium。到目前为止,这是我的代码:

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
Graph=driver.find_elements_by_id("gsc_md_hist_b")
print(Graph.text)

在必须从图表中获取信息(每年的年数和每年的引用数)之前,代码可以正常工作,答复是没有要刮擦的文本。 您能否给我一些想法,以了解如何收集所需的信息?

非常感谢, 伊万

2 个答案:

答案 0 :(得分:0)

您可以尝试通过将xpath与类属性一起使用,然后将所有跨度测试作为列表来获取。请检查以下未经测试的代码:

from selenium import webdriver
mozilla_path = r"C:\Users\ivrav\Python38\geckodriver.exe"
driver = webdriver.Firefox()
driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
driver.maximize_window()
Researcher=driver.find_element_by_xpath("""//*[@id="gsc_rsb_cit"]/div/div[3]/div""") .click()
#Graph=driver.find_elements_by_id("gsc_md_hist_b")
#Graph=driver.find_elements_by_xpath('//div[@class=".gsc_md_hist_b"]//span[@class=".gsc_g_t"]')
Graph=driver.find_elements_by_xpath("//span[@class='gsc_g_t']")

for spanText in Graph:
    print(spanText.text)

BarValue=driver.find_elements_by_xpath("//span[@class='gsc_g_al']")
for barValueText in BarValue:
        print(barValueText.text)

答案 1 :(得分:0)

要提取年份的信息,您必须为visibility_of_element_located()引出WebDriverWait,并且可以使用以下任一Locator Strategies

  • 使用XPATH

    driver.get("https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl=en")
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[@id='gsc_rsb_cit']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']"))).click()
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@id='gsc_md_hist_c']//div[@class='gsc_md_hist_w']/div[@class='gsc_md_hist_b']//span[@class='gsc_g_t']")))])
    
  • 控制台输出:

    ['2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC