在Python中使用BeautifulSoup刮除表格

时间:2019-12-23 20:14:30

标签: python html selenium web-scraping beautifulsoup

我想在单击“国家数据”后从此website中提取一个表格。我知道可以直接下载它,但我对网页抓取非常感兴趣,我想学习。以前,我使用Python中的Selenium包来自动选择包含年份的所有复选框的过程,现在我想将表提取为数据框直到最后一页,以便将整个条目包含在我的数据框中。我认为我要删除的表ID是“ DataTables_Table_26”。

我试图这样做是为了提取第一个表,但是它不起作用。

html = driver.page_source
soup = BeautifulSoup(html, "lxml")

table = soup.find_all('table', {'id':'DataTables_Table_26'})

如果您需要查看代码的前几行,请告诉我。预先感谢。

这是html代码中存在表对象的部分:

2 个答案:

答案 0 :(得分:3)

您可以使用pandasread_html()并存储到dataframe,然后append到每个页面遍历。

您需要使用pip安装熊猫

  

pip安装熊猫

使用无限循环并遍历页面,直到可用下一个按钮,否则将打破无限循环。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd

driver = webdriver.Chrome()
driver.get("https://dataunodc.un.org/GSH_app")
driver.maximize_window()

#Switch the iframe in order to access the link
WebDriverWait(driver,15).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@src='https://unodc.shinyapps.io/GSH_App/']")))
#Click on National Data link
WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.XPATH,'//ul[@class="nav navbar-nav"]//a[text()="National Data"]'))).click()

#Get all checkbox which are not selected.
allchekbox=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.XPATH,"//input[@name='YearVar' and not(@checked='checked')]")))

for item in allchekbox:
    item.click()

WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.display.dataTable.no-footer")))
time.sleep(2)
html = driver.page_source
df=pd.read_html(str(html))[0]
while True:

    try:
      WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//a[@class='paginate_button next']"))).click()
      time.sleep(1) #slowdown for loop to load data.
      html = driver.page_source
      df1 = pd.read_html(str(html))[0]
      #append data into dataframe
      df=df.append(df1,ignore_index=True)

    except:
      break

#load the data into csv file to verify all data
df.to_csv("testdata.csv")
print(df)

加载后的CSV文件快照。

enter image description here

答案 1 :(得分:2)

如果您可以获取数据链接,那么这里是一个很好的表格抓取工具,您只需传递要抓取的表格即可

#Main function
def getContent(link, filename, whichtable=0):
    result1 = requests.get(link)
    src1 = result1.content
    soup = BeautifulSoup(src1,'lxml')
    table = soup.find_all('table')[whichtable]
    with open(filename,'w',newline='') as f:
        writer = csv.writer(f)
        for tr in table('tr'):
            row = [t.get_text(strip=True)for t in tr(['td','th'])]
            writer.writerow(row)


getContent('https://teams.technion.ac.il/residency-placements/', 'what.csv', whichtable=0)
df2= pd.read_csv('what.csv')
or
getContent('https://en.wikipedia.org/wiki/List_of_highest-grossing_films', 'what.csv', whichtable=0)


You can see it's a very good table scrapper, and you should be able to use it on your site if you can get to the page with the actual table

只要您设置了哪个表,此刮板将读取网站上的任何表,因为某些站点具有多个表