无法使用BeautifulSoup Scrape数据

时间:2018-02-21 05:00:46

标签: python selenium web-scraping beautifulsoup

我使用Selenium登录网页并获取网页进行报废 我能够获得该页面。 我在html上搜索了一个我想要刮的表。 这是: -

<table cellspacing="0" class=" tablehasmenu table hoverable sensors" id="table_devicesensortable">

这是剧本: -

rawpage=driver.page_source #storing the webpage in variable
souppage=BeautifulSoup(rawpage,'html.parser') #parsing the webpage
tbody=souppage.find('table', attrs={'id':'table_devicesensortable'}) #scrapping

我能够在souppage变量中获取已解析的网页。 但是无法在tbody变量中进行刮擦和存储。

3 个答案:

答案 0 :(得分:2)

可能会动态生成必需的表,因此您需要等到它出现在页面上:

from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait

tbody = wait(driver, 10).until(EC.presence_of_element_located((By.ID, "table_devicesensortable")))

另请注意,不需要使用BeautifulSoup,因为Selenium有足够的内置方法和属性来为您完成相同的工作,例如。

headers = tbody.find_elements_by_tag_name("th")
rows = tbody.find_elements_by_tag_name("tr")
cells = tbody.find_elements_by_tag_name("td")
cell_values = [cell.text for cell in cells]
etc...

答案 1 :(得分:0)

根据 HTML ,您共享以<table>抓取expected_conditions class并将from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[@class=' tablehasmenu table hoverable sensors' and @id='table_devicesensortable']"))) rawpage=driver.page_source #storing the webpage in variable souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage tbody=souppage.find("table",{"class":" tablehasmenu table hoverable sensors"}) #scrapping 子句设置为WebDriverWait并实现您可以使用以下任一代码块:

  • 使用id

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//table[@class=' tablehasmenu table hoverable sensors' and @id='table_devicesensortable']")))
    rawpage=driver.page_source #storing the webpage in variable
    souppage=BeautifulSoup(rawpage,"html.parser") #parsing the webpage
    tbody=souppage.find("table",{"id":"table_devicesensortable"}) #scrapping
    
  • 使用id

    employee

答案 2 :(得分:0)

我在stackoverflow上搜索该问题并发现了这篇文章

BeautifulSoup returning none when element definitely exists

通过阅读luiyezheng提供的答案,我得到的提示可能是动态获取数据。因此,表可能是动态创建的,因此我无法找到。

所以,解决方法是: -

在存储网页之前我推迟了

所以代码就像这样

post_max_size,memory_limit,upload_max_filesize

我希望它可以帮助别人。