我希望从特定网页中抓取一张表格。问题在于表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套的表。
我要从中抓取的网页是以下Click here 。
我提供了一个小样本表格,该表格要与span标签中包含的嵌套表格一起使用类工具提示图标进行抓取。抓取整个表格时,如何排除这些特定的span标记内的内容
<tr style="font-size:12px;">
<td align="left">Abhanpur</td>
<td align="center">53</td>
<td align="left">
<table>
<tbody>
<tr>
<td>DHANENDRA SAHU</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Assembly Election Result 2013</h3>
<table>
<tbody>
<tr>
<td>Party</td>
<td>:</td>
<td>Indian National Congress</td>
</tr>
<tr>
<td>Result</td>
<td>:</td>
<td>WON</td>
</tr>
<tr>
<td>Margin</td>
<td>:</td>
<td>8354</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Indian National Congress</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>68</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="left">CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA</td>
<td align="left">
<table>
<tbody>
<tr>
<td>Bharatiya Janata Party</td>
<td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
<div class="tooltip">
<h3>Current Assembly Election Result</h3>
<table>
<tbody>
<tr>
<td>Leading In</td>
<td>:</td>
<td>0</td>
</tr>
<tr>
<td>Won In</td>
<td>:</td>
<td>15</td>
</tr>
<tr>
<td>Trailing In</td>
<td>:</td>
<td>0</td>
</tr>
</tbody>
</table>
</div>
</td>
</tr>
</tbody>
</table>
</td>
<td align="right">23471 </td>
<td align="center">Result Declared</td>
<td align="center" style="background-color: lightgray;">DHANENDRA SAHU</td>
<td align="center" style="background-color: lightgray;">Indian National Congress</td>
<td align="center" style="background-color: lightgray;">8354</td>
我还包括了我目前用来刮擦桌子的完整python脚本。我已经成功抓取了整个表格,但无法排除嵌套的范围和表格内容。
我目前以csv格式得到的输出结果如下(只是整个示例行中的示例行)。在第3列中,如“ iAssembly选举结果”
所示,span标签也被废弃了。Abhanpur,53,DHANENDRA SAHUiAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,DHANENDRA SAHU,iAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,Party,:,Indian National Congress,Result,:,WON,Margin,:,8354,Indian National CongressiCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Indian National Congress,iCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Leading In,:,0,Won In,:,68,Trailing In,:,0,CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA,Bharatiya Janata PartyiCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Bharatiya Janata Party,iCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Leading In,:,0,Won In,:,15,Trailing In,:,0,23471 ,Result Declared,DHANENDRA SAHU,Indian National Congress,8354,
预期的结果是刮除不包括span标签及其嵌套表的表。例如
Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared
对此的任何帮助将非常有帮助。谢谢。
答案 0 :(得分:1)
您可以使用熊猫来做到这一点:
import pandas as pd
page = pd.read_html('http://eciresults.nic.in/Statewises26.htm')
my_table = page[5]
我相信,这将为您提供一个包含您感兴趣的表的熊猫数据框。如果您尝试:
my_table.iloc[[7]]
输出为:
7 Abhanpur 53 DHANENDRA SAHUiAssembly Election Result 2013Pa... Indian National CongressiCurrent Assembly Elec... CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata PartyiCurrent Assembly Electi... 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354 NaN NaN
如果您要这样做,则可以使用标准的pandas方法清理表。
答案 1 :(得分:1)
这只是我的偏爱,但是每当看到<table>
标签时,我都会使用Pandas进行解析,然后根据需要操作数据框。它还允许您在一行中写入文件:
import pandas as pd
results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'
dfs = pd.read_html(url)
df = dfs[0]
idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df)
for x in url_list:
url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
print ('Processed %s' %url)
dfs = pd.read_html(url)
df = dfs[0]
df.columns = cols
df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')
df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]
results_df = results_df.append(df).reset_index(drop=True)
results_df.to_csv('Chhattisgarh_cand.csv', index=False)
输出:
print (df.to_string())
Constituency Const. No. Leading Candidate Leading Party Trailing Candidate Trailing Party Margin Status Winning Candidate Winning Party Margin
0 Abhanpur 53 DHANENDRA SAHU Indian National Congress CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA Bharatiya Janata Party 23471 Result Declared DHANENDRA SAHU Indian National Congress 8354
1 Ahiwara 67 GURU RUDRA KUMAR Indian National Congress RAJMAHANT SANWLA RAM DAHRE Bharatiya Janata Party 31687 Result Declared RAJMAHNT SANWLA RAM DAHRE Bharatiya Janata Party 31676
2 Akaltara 33 SAURABH SINGH Bharatiya Janata Party RICHA JOGI Bahujan Samaj Party 1854 Result Declared CHUNNILAL SAHU Indian National Congress 21693
3 Ambikapur 10 T.S. BABA Indian National Congress ANURAG SINGH DEO Bharatiya Janata Party 39624 Result Declared T.S.BABA Indian National Congress 19558
4 Antagarh 79 ANOOP NAG Indian National Congress VIKRAM USENDI Bharatiya Janata Party 13414 Result Declared VIKRAM USENDI Bharatiya Janata Party 5171
5 Arang 52 DR. SHIVKUMAR DAHARIYA Indian National Congress SANJAY DHIDHI Bharatiya Janata Party 25077 Result Declared NAVEEN MARKANDEY Bharatiya Janata Party 13774
6 Baikunthpur 3 AMBICA SINGH DEO Indian National Congress BHAIYALAL RAJWADE Bharatiya Janata Party 5339 Result Declared BHAIYALAL RAJWADE Bharatiya Janata Party 1069
7 Balodabazar 45 PRAMOD KUMAR SHARMA Janta Congress Chhattisgarh (J) JANAK RAM VERMA Indian National Congress 2129 Result Declared JANAK RAM VERMA Indian National Congress 9977
8 Basna 40 DEVENDRA BAHADUR SINGH Indian National Congress SAMPAT AGRAWAL Independent 17508 Result Declared RUPKUMARI CHOUDHARY Bharatiya Janata Party 6239
9 Bastar 85 BAGHEL LAKHESHWAR Indian National Congress DR. SUBHAU KASHYAP Bharatiya Janata Party 33471 Result Declared BAGHEL LAKHESHWAR Indian National Congress 19168