如何在python中解析多个tr标记

时间:2016-02-16 17:55:28

标签: python html beautifulsoup

我目前在解析表中出现的所有tr标签时遇到问题,我能够解析第一个tr标签,但我无法理解如何解析所有后续的tr标签,我想到了使用for循环,但它没有工作。 我只包含了部分代码,其中包含我想要存储在json文件中的tr标签。

这是我试过的:

def parseFacultyPage(br, facultyID):
    if br is None:
        return None

    br.open('https://academics.vit.ac.in/student/stud_home.asp')
    response = br.open('https://academics.vit.ac.in/student/class_message_view.asp?sem=' + facultyID)
    html = response.read()
    soup = BeautifulSoup(html)
    tables = soup.findAll('table')

    # Extracting basic information of the faculty
    infoTable = tables[0].findAll('tr')
    name = infoTable[2].findAll('td')[0].text
    if (len(name) is 0):
        return None
    subject = infoTable[2].findAll('td')[1].text
    msg = infoTable[2].findAll('td')[2].text
    sent = infoTable[2].findAll('td')[3].text
    emailmsg = 'Subject: New VIT Email' + msg

如果tr标签不止一个,那么这是示例html代码。

<table width="79%" border="0" cellpadding="0" cellspacing="0" height="350">
  <tr>
    <td valign="top" width="1%" bgcolor=#FFFFFF>
        &nbsp;
    </td>
    <td valign="top" width="78%" bgcolor=#FFFFFF>



    <center><b><u>VIEW CLASS MESSAGE - Winter Semester 2015~16</u></b></center>
    <br><br>


        <br>
        <table cellpadding=4 cellspacing=2 border=0 bordercolor='black' width="100%">

        <tr bgcolor=#5A768D>
            <td width="25%"><font color=#FFFFFF>From</font></td>
            <td width="25%"><font color=#FFFFFF>Course</font></td>
            <td><font color=#FFFFFF>Message</font></td>
            <td width="10%"><font color=#FFFFFF>Posted On</font></td>
        </tr>

            <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
                <td valign="top">RAGHAVAN R (SITE)</td>
                <td valign="top">ITE308 - Distributed Systems - TH</td>
                <td valign="top">Dear students,

As informed in the class, this is to remind you Today special class from 6 to 6.50 pm at same venue SJT 126.

regards

R. Raghavan
SITE</td>
                <td valign="top">11/02/2016 11:42:57</td>
            </tr>

            <tr bgcolor="#EDEADE" onMouseOut="this.bgColor='#EDEADE'" onMouseOver="this.bgColor='#FFF9EA'">
                <td valign="top">SMART (APT) (ACAD)</td>
                <td valign="top">STS302 - Soft Skills - SS</td>
                <td valign="top">Dear Students,

As  04 Feb 16 to 08 Feb 16 were announced as “No Instruction days”, the first assessment that was supposed to happen from 08 Feb 16 to 12 Feb 16 is being postponed to 7th week (15 Feb 16 to 19 Feb 16)
</td>
                <td valign="top">10/02/2016 21:48:14</td>
            </tr>

        <tr bgcolor=#5A768D>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
            <td>&nbsp;</td>
        </tr>

        </table>


    <br><br>
    </td>
  </tr>
</table>

1 个答案:

答案 0 :(得分:3)

你应该首先迭代抛出下面和每行中的行,在开始时将列查询到columns变量

for index, row in enumerate(tables[1].findAll('tr')):
    if index==0:
        continue

    columns= row.findAll('td')
    name = columns[0].text
    if not name:
        return None
    subject = columns[1].text
    msg = columns[2].text
    sent = columns[3].text

编辑:看起来你的html有两个表格结构。你需要内在的。因此,请使用索引1代替tables[1]

我还在迭代器周围添加了enumerate,因此您还拥有行索引。使用此功能,您可以在index==0

时跳过标题行