我想在单击“国家数据”后从此website中提取一个表格。我知道可以直接下载它,但我对网页抓取非常感兴趣,我想学习。以前,我使用Python中的Selenium包来自动选择包含年份的所有复选框的过程,现在我想将表提取为数据框直到最后一页,以便将整个条目包含在我的数据框中。我认为我要删除的表ID是“ DataTables_Table_26”。
我试图这样做是为了提取第一个表,但是它不起作用。
html = driver.page_source
soup = BeautifulSoup(html, "lxml")
table = soup.find_all('table', {'id':'DataTables_Table_26'})
如果您需要查看代码的前几行,请告诉我。预先感谢。
这是html代码中存在表对象的部分:
答案 0 :(得分:3)
您可以使用pandas
到read_html
()并存储到dataframe
,然后append
到每个页面遍历。
您需要使用pip安装熊猫
pip安装熊猫
使用无限循环并遍历页面,直到可用下一个按钮,否则将打破无限循环。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://dataunodc.un.org/GSH_app")
driver.maximize_window()
#Switch the iframe in order to access the link
WebDriverWait(driver,15).until(EC.frame_to_be_available_and_switch_to_it((By.XPATH,"//iframe[@src='https://unodc.shinyapps.io/GSH_App/']")))
#Click on National Data link
WebDriverWait(driver,5).until(EC.element_to_be_clickable((By.XPATH,'//ul[@class="nav navbar-nav"]//a[text()="National Data"]'))).click()
#Get all checkbox which are not selected.
allchekbox=WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.XPATH,"//input[@name='YearVar' and not(@checked='checked')]")))
for item in allchekbox:
item.click()
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.display.dataTable.no-footer")))
time.sleep(2)
html = driver.page_source
df=pd.read_html(str(html))[0]
while True:
try:
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//a[@class='paginate_button next']"))).click()
time.sleep(1) #slowdown for loop to load data.
html = driver.page_source
df1 = pd.read_html(str(html))[0]
#append data into dataframe
df=df.append(df1,ignore_index=True)
except:
break
#load the data into csv file to verify all data
df.to_csv("testdata.csv")
print(df)
加载后的CSV文件快照。
答案 1 :(得分:2)
如果您可以获取数据链接,那么这里是一个很好的表格抓取工具,您只需传递要抓取的表格即可
#Main function
def getContent(link, filename, whichtable=0):
result1 = requests.get(link)
src1 = result1.content
soup = BeautifulSoup(src1,'lxml')
table = soup.find_all('table')[whichtable]
with open(filename,'w',newline='') as f:
writer = csv.writer(f)
for tr in table('tr'):
row = [t.get_text(strip=True)for t in tr(['td','th'])]
writer.writerow(row)
getContent('https://teams.technion.ac.il/residency-placements/', 'what.csv', whichtable=0)
df2= pd.read_csv('what.csv')
or
getContent('https://en.wikipedia.org/wiki/List_of_highest-grossing_films', 'what.csv', whichtable=0)
You can see it's a very good table scrapper, and you should be able to use it on your site if you can get to the page with the actual table
只要您设置了哪个表,此刮板将读取网站上的任何表,因为某些站点具有多个表