如何从网页中清除表格并排除在表格标签内设置的特定表格

时间:2019-05-02 10:07:36

标签: html python-3.x web-scraping beautifulsoup

我希望从特定网页中抓取一张表格。问题在于表的某些td包含一个嵌套的span标记,该标记包含另一个嵌套的表。

我要从中抓取的网页是以下Click here

我提供了一个小样本表格,该表格要与span标签中包含的嵌套表格一起使用类工具提示图标进行抓取。抓取整个表格时,如何排除这些特定的span标记内的内容

<tr style="font-size:12px;">
<td align="left">Abhanpur</td>
<td align="center">53</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>DHANENDRA SAHU</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Assembly Election Result 2013</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Party</td>
                                    <td>:</td>
                                    <td>Indian National Congress</td>
                                </tr>
                                <tr>
                                    <td>Result</td>
                                    <td>:</td>
                                    <td>WON</td>
                                </tr>
                                <tr>
                                    <td>Margin</td>
                                    <td>:</td>
                                    <td>8354</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>Indian National Congress</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Current Assembly Election Result</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Leading In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                                <tr>
                                    <td>Won In</td>
                                    <td>:</td>
                                    <td>68</td>
                                </tr>
                                <tr>
                                    <td>Trailing In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="left">CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA</td>
<td align="left">
    <table>
        <tbody>
            <tr>
                <td>Bharatiya Janata Party</td>
                <td style="vertical-align:top"><span class="tooltip-icon" style="display:block">i</span>
                    <div class="tooltip">
                        <h3>Current Assembly Election Result</h3>
                        <table>
                            <tbody>
                                <tr>
                                    <td>Leading In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                                <tr>
                                    <td>Won In</td>
                                    <td>:</td>
                                    <td>15</td>
                                </tr>
                                <tr>
                                    <td>Trailing In</td>
                                    <td>:</td>
                                    <td>0</td>
                                </tr>
                            </tbody>
                        </table>
                    </div>
                </td>
            </tr>
        </tbody>
    </table>
</td>
<td align="right">23471 </td>
<td align="center">Result Declared</td>
<td align="center" style="background-color: lightgray;">DHANENDRA SAHU</td>
<td align="center" style="background-color: lightgray;">Indian National Congress</td>
<td align="center" style="background-color: lightgray;">8354</td>

我还包括了我目前用来刮擦桌子的完整python脚本。我已经成功抓取了整个表格,但无法排除嵌套的范围和表格内容。

full scrapper code here

我目前以csv格式得到的输出结果如下(只是整个示例行中的示例行)。在第3列中,如“ iAssembly选举结果”

所示,span标签也被废弃了。
Abhanpur,53,DHANENDRA SAHUiAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,DHANENDRA SAHU,iAssembly Election Result 2013Party:Indian National CongressResult:WONMargin:8354,Party,:,Indian National Congress,Result,:,WON,Margin,:,8354,Indian National CongressiCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Indian National Congress,iCurrent Assembly Election ResultLeading In:0Won In:68Trailing In:0,Leading In,:,0,Won In,:,68,Trailing In,:,0,CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA,Bharatiya Janata PartyiCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Bharatiya Janata Party,iCurrent Assembly Election ResultLeading In:0Won In:15Trailing In:0,Leading In,:,0,Won In,:,15,Trailing In,:,0,23471                                             ,Result Declared,DHANENDRA SAHU,Indian National Congress,8354,

预期的结果是刮除不包括span标签及其嵌套表的表。例如

Abhanpur, 53 , DHANENDRA SAHU, Indian National Congress, CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA, Bharatiya Janata Party , 23471, Result Declared 

对此的任何帮助将非常有帮助。谢谢。

2 个答案:

答案 0 :(得分:1)

您可以使用熊猫来做到这一点:

import pandas as pd
page = pd.read_html('http://eciresults.nic.in/Statewises26.htm')
my_table = page[5]

我相信,这将为您提供一个包含您感兴趣的表的熊猫数据框。如果您尝试:

my_table.iloc[[7]]

输出为:

7   Abhanpur    53  DHANENDRA SAHUiAssembly Election Result 2013Pa...   Indian National CongressiCurrent Assembly Elec...   CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata PartyiCurrent Assembly Electi...   23471   Result Declared     DHANENDRA SAHU  Indian National Congress    8354    NaN     NaN

如果您要这样做,则可以使用标准的pandas方法清理表。

答案 1 :(得分:1)

这只是我的偏爱,但是每当看到<table>标签时,我都会使用Pandas进行解析,然后根据需要操作数据框。它还允许您在一行中写入文件:

import pandas as pd

results_df = pd.DataFrame()
url_list = [1,2,3,4,5,6,7,8]
url = 'http://eciresults.nic.in/Statewises26.htm'

dfs = pd.read_html(url)
df = dfs[0]

idx = df[df[0] == '1\xa02\xa03\xa04\xa05\xa06\xa07\xa08\xa09\xa0Next >>'].index[0]
cols = list(df.iloc[idx-1,:])


df.columns = cols

df = df[df['Const. No.'].notnull()]
df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
df = df.dropna(axis=1,how='all')

df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

results_df = results_df.append(df)

for x in url_list:
    url = 'http://eciresults.nic.in/Statewises26%s.htm' %x
    print ('Processed %s' %url)
    dfs = pd.read_html(url)
    df = dfs[0]

    df.columns = cols

    df = df[df['Const. No.'].notnull()]
    df = df.loc[df['Const. No.'].str.isdigit()].reset_index(drop=True)
    df = df.dropna(axis=1,how='all')

    df['Leading Candidate'] = df['Leading Candidate'].str.split('i',expand=True)[0]
    df['Leading Party'] = df['Leading Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Party'] = df['Trailing Party'].str.split('iCurrent',expand=True)[0]
    df['Trailing Candidate'] = df['Trailing Candidate'].str.split('iAssembly',expand=True)[0]

    results_df = results_df.append(df).reset_index(drop=True)

results_df.to_csv('Chhattisgarh_cand.csv', index=False)

输出:

print (df.to_string())
  Constituency Const. No.       Leading Candidate                    Leading Party                    Trailing Candidate            Trailing Party Margin           Status          Winning Candidate             Winning Party Margin
0     Abhanpur         53          DHANENDRA SAHU         Indian National Congress  CHANDRASHEKHAR SAHU - CHAMPU BHAIYYA    Bharatiya Janata Party  23471  Result Declared             DHANENDRA SAHU  Indian National Congress   8354
1      Ahiwara         67        GURU RUDRA KUMAR         Indian National Congress            RAJMAHANT SANWLA RAM DAHRE    Bharatiya Janata Party  31687  Result Declared  RAJMAHNT SANWLA RAM DAHRE    Bharatiya Janata Party  31676
2     Akaltara         33           SAURABH SINGH           Bharatiya Janata Party                            RICHA JOGI       Bahujan Samaj Party   1854  Result Declared             CHUNNILAL SAHU  Indian National Congress  21693
3    Ambikapur         10               T.S. BABA         Indian National Congress                      ANURAG SINGH DEO    Bharatiya Janata Party  39624  Result Declared                   T.S.BABA  Indian National Congress  19558
4     Antagarh         79               ANOOP NAG         Indian National Congress                         VIKRAM USENDI    Bharatiya Janata Party  13414  Result Declared              VIKRAM USENDI    Bharatiya Janata Party   5171
5        Arang         52  DR. SHIVKUMAR DAHARIYA         Indian National Congress                         SANJAY DHIDHI    Bharatiya Janata Party  25077  Result Declared           NAVEEN MARKANDEY    Bharatiya Janata Party  13774
6  Baikunthpur          3        AMBICA SINGH DEO         Indian National Congress                     BHAIYALAL RAJWADE    Bharatiya Janata Party   5339  Result Declared          BHAIYALAL RAJWADE    Bharatiya Janata Party   1069
7  Balodabazar         45     PRAMOD KUMAR SHARMA  Janta Congress Chhattisgarh (J)                       JANAK RAM VERMA  Indian National Congress   2129  Result Declared            JANAK RAM VERMA  Indian National Congress   9977
8        Basna         40  DEVENDRA BAHADUR SINGH         Indian National Congress                        SAMPAT AGRAWAL               Independent  17508  Result Declared        RUPKUMARI CHOUDHARY    Bharatiya Janata Party   6239
9       Bastar         85       BAGHEL LAKHESHWAR         Indian National Congress                    DR. SUBHAU KASHYAP    Bharatiya Janata Party  33471  Result Declared          BAGHEL LAKHESHWAR  Indian National Congress  19168