使用Selenium和BeautifulSoap进行Web抓取

时间:2020-05-26 10:36:25

标签: python selenium selenium-webdriver beautifulsoup webdriver

下面是元素的检查

structure(list(area_code = c(2, 2, 2, 2, 2, 2), area = c("Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"), item_code = c(2511, 2805, 2513, 2513, 2514, 2514), item = c("Wheat and products", "Rice (Milled Equivalent)", "Barley and products", "Barley and products", "Maize and products", "Maize and products"), element_code = c(0, 0, 1, 0, 1, 0), element = c("Food", "Food", "Feed", "Food", "Feed", "Food"), Y1961 = c(0, 183, 76, 237, 210, 403), Y2013 = c(0, 422, 360, 89, 200, 76)), row.names = c(NA,-6L),class = c("tbl_df", "tbl", "data.frame"))

如何将值<div class="input-group ref-container "> <input id="sys_display.incident.assignment_group" name="sys_display.incident.assignment_group" aria-labelledby="label.incident.assignment_group" type="search" autocomplete="off" autocorrect="off" value="PeopleSoft Reporting ONLY" ac_columns="u_full_name" data-type="ac_reference_input" data-completer="AJAXTableCompleter" data-dependent="" data-dependent-value="" data-ref-qual="" data-ref="incident.assignment_group" data-ref-key="null" data-ref-dynamic="false" data-name="assignment_group" data-table="sys_user_group" class="form-control element_reference_input " style="; " spellcheck="false" onfocus="if (!this.ac) addLoadEvent(function() {var e = gel('sys_display.incident.assignment_group'); if (!e.ac) new AJAXTableCompleter(gel('sys_display.incident.assignment_group'), 'incident.assignment_group', '', ''); e.ac.onFocus();})" aria-required="true" role="combobox" aria-autocomplete="list" aria-owns="AC.incident.assignment_group"> <span class="ref_dynamic_placeholder">A new record with this value will be created automatically</span> <span class="input-group-btn"> <button id="lookup.incident.assignment_group" name="lookup.incident.assignment_group" type="button" class="btn btn-default" title="Lookup using list" aria-haspopup="true" data-for="sys_display.incident.assignment_group" data-type="ac_reference_input" tabindex="-1" role="button" aria-label="Look up value for field: Assignment group" data-original-title="Lookup using list"> <span class="icon icon-search" aria-hidden="true"> </span> </button> </span> </div> 写入变量。

预先感谢

2 个答案:

答案 0 :(得分:2)

您可以按id=属性进行选择。例如:

txt = '''<div class="input-group ref-container "><input id="sys_display.incident.assignment_group" name="sys_display.incident.assignment_group" aria-labelledby="label.incident.assignment_group" type="search" autocomplete="off" autocorrect="off" value="PeopleSoft Reporting ONLY" ac_columns="u_full_name" data-type="ac_reference_input" data-completer="AJAXTableCompleter" data-dependent="" data-dependent-value="" data-ref-qual="" data-ref="incident.assignment_group" data-ref-key="null" data-ref-dynamic="false" data-name="assignment_group" data-table="sys_user_group" class="form-control element_reference_input  " style="; " spellcheck="false" onfocus="if (!this.ac) addLoadEvent(function() {var e = gel('sys_display.incident.assignment_group'); if (!e.ac) new AJAXTableCompleter(gel('sys_display.incident.assignment_group'), 'incident.assignment_group', '', ''); e.ac.onFocus();})" aria-required="true" role="combobox" aria-autocomplete="list" aria-owns="AC.incident.assignment_group"><span class="ref_dynamic_placeholder">A new record with this value will be created automatically</span><span class="input-group-btn"><button id="lookup.incident.assignment_group" name="lookup.incident.assignment_group" type="button" class="btn btn-default" title="Lookup using list" aria-haspopup="true" data-for="sys_display.incident.assignment_group" data-type="ac_reference_input" tabindex="-1" role="button" aria-label="Look up value for field: Assignment group" data-original-title="Lookup using list"><span class="icon icon-search" aria-hidden="true"></span></button></span></div>'''

soup = BeautifulSoup(txt, 'html.parser')

s = soup.select_one('#sys_display\.incident\.assignment_group')['value']
print(s)

打印:

PeopleSoft Reporting ONLY

与以下相同:

s = soup.find(id="sys_display.incident.assignment_group")['value']
print(s)

答案 1 :(得分:0)

使用Selenium提取文本仅PeopleSoft Reporting ,您可以使用以下任一解决方案:

  • 使用css_selector

    print(driver.find_element_by_css_selector("input[id^='sys_display'][name*='incident'][aria-labelledby$='assignment_group']").get_attribute("value"))
    
  • 使用xpath

    print(driver.find_element_by_xpath("//input[starts-with(@id, 'sys_display')][contains(@name, 'incident')][contains(@aria-labelledby, 'assignment_group')]").get_attribute("value"))
    

不过,按照最佳做法提取/打印所需的文本,您需要为visibility_of_element_located()引入WebDriverWait,并且可以使用以下Locator Strategies之一:

  • 使用CSS_SELECTOR

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".data-row"))).get_attribute("value"))
    
  • 使用XPATH

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//*[@class='data-row']"))).get_attribute("value"))
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

更新

使用方式:

  • find_element_by_*,您将遇到 NoSuchElementException
  • 您遇到 TimeoutException
  • WebDriverWait

该元素可能不在顶级内容之内,并且可能在<iframe>之内。现在要讨论NoSuchElementException,请按照thisthis的讨论进行。