如何通过beautifulsoup从嵌套表中提取值

时间:2019-10-22 02:43:12

标签: python html beautifulsoup

我想从html的嵌套表中提取一些值。我尝试过漂亮,但没有成功。感谢任何人都可以提供帮助。

html就像这样:

<table style="border-width:0px;width:100%;">
                <tr valign="middle">
                    <td style="width:400px;"><span><span style='font-size: 12px;'>Race 1</span><br /><br /></span><span><span style='font-size: 12px;'><strong>Grade:</strong>&nbsp;&nbsp;M&nbsp;&nbsp;&nbsp;400 metres</span>
                        <br /></span>
                        <span><span style='font-size: 12px;'><strong>Prize Money:</strong> $1180</span>&nbsp;&nbsp;&nbsp;$825 - $235 - $120<br /><br /></span>
                        <table>
                        <tr valign="middle">
                            <td style="width:105px;"><span>Race Time:</span></td><td align="left" style="width:50px;"><span>(8.44)</span></td><td align="left" style="width:50px;"><span>(0.00)</span></td><td align="left" style="width:50px;"><span>(22.95)</span></td><td></td>
                        </tr><tr valign="middle">
                            <td style="width:105px;"><span>Sectional Time:</span></td><td align="left" style="width:50px;"><span>8.44</span></td><td align="left" style="width:50px;"><span>0.00</span></td><td align="left" style="width:50px;"><span>14.51</span></td><td></td>
                        </tr><tr valign="middle">
                            <td style="width:150px;"><span>1<sup>st</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' />&nbsp;<img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' />&nbsp;<img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' />&nbsp;<img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' />&nbsp;<img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' />&nbsp;</span></td>
                        </tr><tr valign="middle">
                            <td><span>2<sup>nd</sup> In-Running Position:</span></td><td colspan="4"><span><img src='/Images/BoxNumber1_s.gif' width='20px' alt='1' />&nbsp;<img src='/Images/BoxNumber5_s.gif' width='20px' alt='5' />&nbsp;<img src='/Images/BoxNumber2_s.gif' width='20px' alt='2' />&nbsp;<img src='/Images/BoxNumber7_s.gif' width='20px' alt='7' />&nbsp;<img src='/Images/BoxNumber4_s.gif' width='20px' alt='4' />&nbsp;</span></td>
                        </tr>
                        </table>                        </td>
                    <td class="ResultsPageRightColumn" valign="bottom"></td>
                </tr>
            </table>

我的预期结果是这样的:

enter image description here

非常感谢!

1 个答案:

答案 0 :(得分:0)

我设计出如下代码,它可以工作,但就代码优雅性而言可能并不完美。

def getRaceInfo(rawString):
    pattern = re.compile(r"(<span style='font-size: 12px;'>Race).*?\d+(<br />)", re.IGNORECASE)
    matches = pattern.finditer(rawString)
    #print(matches)
    spanlist = []
    for m in matches: 
        spanlist.append(m.span())

    raceI= []
    replacements = [("<span style='font-size: 12px;'>", ''), 
                    ('&nbsp;&nbsp;&nbsp;', '|'), ('&nbsp;&nbsp;', ''),
                    ('<span>', ''), ('<br />', ''), ('<strong>', ''), ('</strong>',''),
                    ('</span>', '|'), ('||', '|')
                   ]

    for (i, j) in spanlist:
        nString = rawString[i:j]
        for t, r in replacements: 
            nString = nString.replace(t, r)
        raceI.append(nString)

    if len(raceI) != 0: 
        df = pd.DataFrame(raceI,columns=['temp'])
        df[['RACE_NUM','Race_Grade','Race_Distance','Race_PrizeMoney1','Race_PrizeMoney2']] = df['temp'].str.split('|',expand=True)
        df['Race_Other'] = np.nan

        try:
            df[['RACE_NUM','RACE_Detail']] = df['RACE_NUM'].str.split(' :: ',expand=True)
        except: 
            df['RACE_Detail']=np.nan

        df = df.drop(['temp'], axis=1)
    else: 
        pattern = re.compile(r"(font-size: 12px;\'>Race).*?(</span><br /><br /></span><table>)", re.IGNORECASE)
        matches = pattern.finditer(rawString)

        spanlist = []
        for m in matches: 
            spanlist.append(m.span())

        raceI= []
        replacements = [("font-size: 12px;\'>", ''), ("<span style='", ''), ('<table>', ''), 
                        ('&nbsp;&nbsp;&nbsp;', '|'), ('&nbsp;&nbsp;', ''),
                        ('<span>', ''), ('<br />', ''), ('<strong>', ''), ('</strong>',''),
                        ('</span>', '|'), ('||', '|'), ('><span style=\'font-size: 12px;\'>', '|')
                       ]

        for (i, j) in spanlist:
            nString = rawString[i:j]
            for t, r in replacements: 
                nString = nString.replace(t, r)
            raceI.append(nString)

        if len(raceI) != 0:
            df = pd.DataFrame(raceI,columns=['temp'])
            df[['RACE_NUM','Race_Grade','Race_Distance','Race_Other', 'Race_PrizeMoney1']] = df['temp'].str.split('|',expand=True)
            df['Race_PrizeMoney2'] = np.nan

            try:
                df[['RACE_NUM','RACE_Detail']] = df['RACE_NUM'].str.split(' :: ',expand=True)
            except: 
                df['RACE_Detail']=np.nan

            df = df.drop(['temp'], axis=1)
        else: 
            df = pd.DataFrame(columns=['RACE_NUM','Race_Grade','Race_Distance','Race_Other', 'RACE_Detail', 'Race_PrizeMoney1', 'Race_PrizeMoney2'])

    return df