我正在学习抓取。 Scraping site。
我可以选择:
从使用硒的下拉列表中。
我可以从मौजा का नाम चुने:
中选择。
然后,我可以单击खाता खोजें
按钮。
因此,在底部用javascript填充了表格。
按钮的div代码:
<input type="submit" name="ctl00$ContentPlaceHolder1$BtnSearch" value="खाता खोजें" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$ContentPlaceHolder1$BtnSearch", "", true, "S", "", false, false))" id="ctl00_ContentPlaceHolder1_BtnSearch" style="width:146px;">
分页是通过以下方式完成的:
javascript:__doPostBack('ctl00$ContentPlaceHolder1$GridView1','Page$11')
我无法刮擦这张桌子。
我尝试过的事情:
ctl00_ContentPlaceHolder1_GridView1
不在HTML源代码中。尝试了一些方法,到目前为止没有运气。#p_element = driver.find_element_by_id(id_='ctl00_ContentPlaceHolder1_GridView1')
p_element = driver.find_element_by_xpath('//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]')
print(p_element.text)
path_for_table='//*[@id="aspnetForm"]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table/tbody/tr/td/table/tbody/tr[4]'
table_list = WebDriverWait(driver, 2).until(lambda driver: driver.find_element_by_xpath(path_for_table))
print(table_list)
我浏览过的页面:
答案 0 :(得分:0)
首先,让我们获取该网站。我正在使用BeautifulSoup与Selenium一起刮擦。
import bs4 as Bs
from selenium import webdriver
DRIVER_PATH = 'D:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')
然后单击一个村庄名称(根据您的需要更改)
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()
单击“按钮”按钮:
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()
使用BeautifulSoup获取页面的来源
page_src = Bs.BeautifulSoup(driver.page_source)
找到ID:ctl00_ContentPlaceHolder1_UpdatePanel2
,并在其中找到所有td
:
table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")
获取列并从中获取文本
columns = table_elements[:6]
column_names = [e.text for e in header]
columns
:
[<td>क्रम</td>,
<td>रैयतधारी का नाम</td>,
<td style="white-space:nowrap;">पिता/पति का नाम</td>,
<td>खाता संख्या</td>,
<td>खेसरा संख्या</td>,
<td>अधिकार<br/>अभिलेख</td>]
column_names
:
['क्रम',
'रैयतधारी का नाम',
'पिता/पति का नाम',
'खाता संख्या',
'खेसरा संख्या',
'अधिकारअभिलेख']
接下来获取表格的正文
body_of_table = table_elements[6:-4]
然后为每个条目创建6列的块并将文本取出
chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]
data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]
data:
[['1', 'अरूण कुमार', 'शिवलाल पासवान', '55', '406', 'देखें'],
['2', 'इन्द्रदेव प्रसाद', '\xa0', '98', '789', 'देखें'],
['3', 'ईश्वर मांझी', 'चमारी मांझी', '78', '42', 'देखें'],
['4', 'कवलसिया देवी', 'तुलसी मांझी', '120', '41', 'देखें'],
['5', 'कामदेव पांडे', 'शिवदानी पांडे', '210', '457, 459, 461, 474', 'देखें'],
['6', 'कामेश्वर मांझी', 'उत्ती मांझी', '78', '43', 'देखें'],
['7', 'कारू मांझी', 'राधे मांझी', '78', '42', 'देखें'],
['8', 'कारू मांझी', 'मेघन मांझी', '78', '42', 'देखें'],
['9', 'कौशल्या देवी', 'केदार महतो', '253', '757', 'देखें'],
['10', 'गणेश साव', 'छेदी साव', '156', '236', 'देखें'],
....
现在import Pandas
并使用它从列表列表中创建一个数据框:
import pandas as pd
df = pd.DataFrame(data, columns = column_names)
# set क्रम as index
df.set_index(df.columns[0])
最终结果:
import time # using time.sleep for illustration only. You should use explicit wait
import bs4 as Bs
import pandas as pd
from selenium import webdriver
DRIVER_PATH = 'D:\chromedriver.exe'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('http://lrc.bih.nic.in/ViewRor.aspx?DistCode=36&SubDivCode=2&CircleCode=9')
time.sleep(4)
#click on a village name
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_GridView2"]/tbody/tr[3]/td[1]').click()
time.sleep(2)
# click on खाता खोजें
driver.find_element_by_xpath('//*[@id="ctl00_ContentPlaceHolder1_BtnSearch"]').click()
time.sleep(2)
# ----------- table extracting part ------------------
# get page source
page_src = Bs.BeautifulSoup(driver.page_source)
# find the id: ctl00_ContentPlaceHolder1_UpdatePanel2 and find all tds in it
table_elements = page_src.find("div",{"id":"ctl00_ContentPlaceHolder1_UpdatePanel2"}).find_all("td")
# get columns and get the text out of them
columns = table_elements[:6]
column_names = [e.text for e in header]
# get the body of the table
body_of_table = table_elements[6:-4]
# create chunks of 6 columns for each entry
chunks = [body_of_table[x:x+6] for x in range(0, len(body_of_table), 6)]
# get the text out
data = [[e.text.strip('\n') for e in chunk] for chunk in chunks]
df = pd.DataFrame(data, columns = column_names)
# set क्रम as index
df.set_index(df.columns[0])
print(df)
...
要转义下一页: