试图在python中使用selenium从div类中获取文本

时间:2018-06-07 05:02:29

标签: python selenium

包含我要打印的数据的HTML div类

enter image description here

<div class="gs_a">LR Binford&nbsp;- American antiquity, 1980 - cambridge.org </div>

到目前为止,这是我的代码:

from selenium import webdriver

def Author (SearchVar):

    driver = webdriver.Chrome("/Users/tutau/Downloads/chromedriver")

    driver.get ("https://scholar.google.com/")

    SearchBox = driver.find_element_by_id ("gs_hdr_tsi")

    SearchBox.send_keys(SearchVar)

    SearchBox.submit()

    At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

    print (At)

Author("dog")

我打印时出现的所有内容都是

  

selenium.webdriver.remote.webelement.WebElement   (会话= “9aa956e2bd51f510dd626f6937b01c0e”,   元素= “0.6506218589189958-1”)

不是文字 我是selenium的新手帮助表示赞赏

3 个答案:

答案 0 :(得分:1)

<强>简介

首先,我建议使用更快的解析器在selenium的public class Employee { private String cpfAccNo, empName; private double ordinaryWages, additionalWages, cpfContrib, cdac, mbmf, sinda, ecf, lastEmpDate, age; public Employee(){ this.cpfAccNo = ""; this.empName = ""; this.ordinaryWages = 0; this.additionalWages = 0; this.cpfContrib = 0; this.cdac = 0; this.mbmf = 0; this.sinda = 0; this.ecf = 0; this.lastEmpDate = 0; this.age = 0; } public Employee(String cpfAccNo, String empName, double ordinaryWages, double additionalWages, double cpfContrib, double cdac, double mbmf, double sinda, double ecf, double lastEmpDate, double age) { this.cpfAccNo = cpfAccNo; this.empName = empName; this.ordinaryWages = ordinaryWages; this.additionalWages = additionalWages; this.cpfContrib = cpfContrib; this.cdac = cdac; this.mbmf = mbmf; this.sinda = sinda; this.ecf = ecf; this.lastEmpDate = lastEmpDate; this.age = age; } public String getCpfAccNo() { return cpfAccNo; } public void setCpfAccNo(String cpfAccNo) { this.cpfAccNo = cpfAccNo; } public String getEmpName() { return empName; } public void setEmpName(String empName) { this.empName = empName; } public double getOrdinaryWages() { return ordinaryWages; } public void setOrdinaryWages(double ordinaryWages) { this.ordinaryWages = ordinaryWages; } public double getAdditionalWages() { return additionalWages; } public void setAdditionalWages(double additionalWages) { this.additionalWages = additionalWages; } public double getCpfContrib() { return cpfContrib; } public void setCpfContrib(double cpfContrib) { this.cpfContrib = cpfContrib; } public double getCdac() { return cdac; } public void setCdac(double cdac) { this.cdac = cdac; } public double getMbmf() { return mbmf; } public void setMbmf(double mbmf) { this.mbmf = mbmf; } public double getSinda() { return sinda; } public void setSinda(double sinda) { this.sinda = sinda; } public double getEcf() { return ecf; } public void setEcf(double ecf) { this.ecf = ecf; } public double getLastEmpDate() { return lastEmpDate; } public void setLastEmpDate(double lastEmpDate) { this.lastEmpDate = lastEmpDate; } public double getAge() { return age; } public void setAge(double age) { this.age = age; } } 上选择你的目标。

page_source

解决方案1 ​​

然后,您需要从Web元素中提取import lxml import lxml.html # put this below SearchBox.submit() CSS_SELECTOR = '#gs_res_ccl_mid > :nth-child(1) > .gs_ri > .gs_a' # Define css source = driver.page_source # Get all html At_raw = lxml.html.document_fromstring(source) # Convert At = At_raw.cssselect(CSS_SELECTOR) # Select by CSS 并对其进行正确编码。

text_content()

解决方案2

如果At = At.text_content().encode('utf-8') # Get text and encode print At 包含多行和unicode,您也可以删除它们:

At

答案 1 :(得分:1)

好像你差不多了。也许,根据您共享的 HTML 代码试用,您可以看到所需的输出。

执行以下代码行后:

At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')

WebElement 引用所需的元素(列表中的单个元素)。在下一步中,当您调用print (At)时,会打印 WebElement At ,如下所示:

selenium.webdriver.remote.webelement.WebElement (session="9aa956e2bd51f510dd626f6937b01c0e", element="0.6506218589189958-1")

解决方案

现在,根据你的问题,如果你想提取文字 LR Binford - 美国古代,1980 - cambridge.org ,你必须通过元素调用其中一种方法:

所以你需要改变代码行:

print (At)

以下任一项:

  • 使用 text

    print(At.text)
    
  • 使用 get_attribute(attributeName)

    print(At.get_attribute("innerHTML"))
    
  • 您自己的代码经过微调:

    # -*- coding: UTF-8 -*-
    from selenium import webdriver
    
    def Author (SearchVar):
    
        options = webdriver.ChromeOptions() 
        options.add_argument("start-maximized")
        options.add_argument('disable-infobars')
        driver=webdriver.Chrome(chrome_options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
        driver.get ("https://scholar.google.com/")
        SearchBox = driver.find_element_by_name("q")
        SearchBox.send_keys(SearchVar)
        SearchBox.submit()
        At = driver.find_elements_by_css_selector ('#gs_res_ccl_mid > div:nth-child(1) > div.gs_ri > div.gs_a')
        for item in At:
            print(item.text)
    
    Author("dog")
    
  • 控制台输出:

    …, RJ Marles, LS Pellicore, GI Giancaspro, TL Dog - Drug Safety, 2008 - Springer
    

答案 2 :(得分:0)

您正在打印元素。打印( At.text ),而不是 At