要抓取的网站 https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm?pmju_kod=8898&proj_kod_Fasa=1
要以粗体形式抓取的项目 - 第 1 部分(下面的 HTML)
<form onsubmit="return lucee_form_c9u.check();" name="myForm" enctype="multipart/form-data" action="mPPTProjek3.cfm?mn=BPPT" method="post">
<div align="center" style="background-color: white; border: 1px solid grey;">
<br />
<table class="MainContent" width="100%" align="center">
<tbody>
<tr style="font-weight: bold;">
<td class="column" width="30%">Nama Pemaju</td>
<td>
:
<a style="color: blue;" href="maklumatPemaju.cfm?pmju_Kod=8877">**RAPID UNITY SDN. BHD.**</a>
<font color="red">* Klik Untuk Melihat Maklumat</font>
</td>
</tr>
<tr>
<td class="column">Kod Pemaju</td>
<td>: **8877<**/td></td>
</tr>
<tr>
<td class="column">Kod Fasa</td>
<td>: **1<**/td></td>
</tr>
<tr>
<td class="column">Nama Pemajuan</td>
<td>: **TAMAN UNITY**</td>
</tr>
</tbody>
</table>
</div>
</form>
要在 BOLD 中抓取的项目 - 第 2 部分
此代码需要 selenium 驱动程序才能点击链接
<tr align="center" onclick="change3('15536',this)" style="cursor:pointer" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'">
那么只会出现"name:myForm"之后的95%`
<tr align="center" onclick="change3('15536',this)" style="cursor:pointer" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'">
那么 95% 会变成其他金额
(HTML 下面)
<fieldset title="Maklumat Pemajuan Projek" style="border: 1px solid grey; font-weight: bold; color: black;">
<legend>Maklumat Pemajuan Projek</legend>
<table class="MainContent" width="100%" align="center">
<thead>
<tr class="column">
<th>Bil</th>
<th>Bil Unit</th>
<th>
Jenis<br />
Rumah
</th>
<th>
Kategori<br />
Rumah
</th>
<th>Tingkat</th>
<th>
Harga<br />
Min (RM)
</th>
<th>
Harga<br />
Max (RM)
</th>
</tr>
</thead>
<tbody>
<tr align="center" onclick="change2('15535',this)" style="cursor: pointer;" bgcolor="white" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='white'">
<td>**1**</td>
<td>**2**</td>
<td align="left">
**RUMAH BERKEMBAR**
</td>
<td>
**HARGA TINGGI**
</td>
<td>**1**</td>
<td align="right">**370,000.00**</td>
<td align="right">**394,900.00**</td>
</tr>
<tr align="center" onclick="change3('15536',this)" style="cursor: pointer;" bgcolor="DAEEF3" onmouseover="this.bgColor='#FF9900'" onmouseout="this.bgColor='DAEEF3'">
<td>**2**</td>
<td>**18**</td>
<td align="left">
**RUMAH TERES**
</td>
<td>
**HARGA TINGGI**
</td>
<td>**1**</td>
<td align="right">**190,000.00**</td>
<td align="right">**290,550.00**</td>
</tr>
</tbody>
</table>
<br />
<input name="rekid3" id="rekid3" type="hidden" value="15535" />
<div id="pemajuan">
<script language="JavaScript" type="text/javascript" src="/lucee/formtag-form.cfm"></script>
<script language="JavaScript" type="text/javascript">
function _CF_checkmyForm() {
return lucee_form_czz.check();
}
</script>
<table class="MainContent" width="100%" align="center">
<tbody>
<tr>
<td class="column" width="30%">Jenis Rumah</td>
<td>: RUMAH BERKEMBAR</td>
</tr>
<tr>
<td class="column">Kategori Rumah</td>
<td>: HARGA TINGGI</td>
</tr>
<tr>
<td class="column">Bil Tingkat</td>
<td>: 1</td>
</tr>
<tr>
<td class="column">Bil Unit</td>
<td>: 2</td>
</tr>
<tr>
<td class="column">Harga Minimum</td>
<td>: 370,000.00</td>
</tr>
<tr>
<td class="column">Harga Maximum</td>
<td>: 394,900.00</td>
</tr>
<tr>
<td class="column">Peratusan Pemajuan</td>
<td>: **95%**</td>
</tr>
</tbody>
</table>
<!-- name:myForm -->
<script>
lucee_form_czz = new LuceeForms("myForm", null);
</script>
</div>
</fieldset>
下面是代码,相信我,这是我几个星期后能写的....请帮助我,因为我不知道如何
from selenium import webdriver
from selenium.webdriver.common.by import By
url = "https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm?pmju_kod=8898&proj_kod_Fasa=1"
driver = webdriver.Chrome(executable_path='/Users/freddielee/Downloads/chromedriver')
driver.find_element(By.NAME="need help here")
答案 0 :(得分:1)
我认为只需使用 requests
和 beautifulsoup
即可获得您想要的内容,如下所示:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
params = {"pmju_Kod" : 8877, "proj_Kod_Fasa" : 1}
r = s.get("https://idaman.kpkt.gov.my/idv5xe/98_eHome/maklumatProjek.cfm", params=params)
soup = BeautifulSoup(r.content, "html.parser")
tables = soup.find_all('table', class_="MainContent")
items = []
items.append(tables[0].a.text)
data = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
items.append(data[1][1].strip(': '))
items.append(data[2][1].strip(': '))
items.append(data[3][1].strip(': '))
data = [[td.text for td in tr.find_all('td')] for tr in tables[3].find_all('tr')]
items.append(data[1][2].strip())
items.append(data[1][3].strip())
items.append(data[1][4])
items.append(data[1][5])
items.append(data[1][6])
items.append(data[2][2].strip())
items.append(data[2][3].strip())
items.append(data[2][4])
items.append(data[2][5])
items.append(data[2][6])
# Pemajuan table
params['rekid'] = 419975503
r2 = s.get('https://idaman.kpkt.gov.my/idv5xe/98_eHome/template/pemajuan.cfm', params=params)
soup2 = BeautifulSoup(r2.content, "html.parser")
table = soup2.find('table', class_="MainContent")
data = [[td.text for td in tr.find_all('td')] for tr in table.find_all('tr')]
items.append(data[-1][1].strip(': '))
print(items)
这将为您提供以下项目:
['RAPID UNITY SDN. BHD.', '8877', '1', 'TAMAN UNITY', 'RUMAH BERKEMBAR', 'HARGA TINGGI', '1', '370,000.00', '394,900.00', 'RUMAH TERES', 'HARGA TINGGI', '1', '190,000.00', '290,550.00', '0%']