如何在Python中使用rowspans解析HTML表?

时间:2016-09-01 18:16:46

标签: python html python-3.x beautifulsoup html-table

问题

我试图解析一个带有行抄表的HTML表格,就像在我试图解析我的大学时间表一样。

我遇到的问题是,如果最后一行包含rowspan,则下一行缺少TD,其中rowspan现在是丢失的TD。

我不知道如何解释这一点,我希望能够解析这个时间表。

我尝试了什么

几乎我能想到的一切。

我得到的结果

[
    {
        'blok_eind': 4,
        'blok_start': 3,
        'dag': 4, # Should be 5
        'leraar': 'DOODF000',
        'lokaal': 'ALK C212',
        'vak': 'PROJ-T',
    },
]

正如您所看到的,上面的输出代码段中有vak个值PROJ-Tdag4,而5day应该是dag(又名星期五/ Vrijdag),如下所示:

Table

我想要的结果

Python dict()看起来像上面发布的那个,但是具有正确的值

其中:

  • block_start / blok_start是1~5的int,代表星期一〜星期五
  • block_end / blok_eind是一个表示课程开始时间的int(时间段,表格的左侧)
  • classroom / lokaal是一个int,代表课程结束的块
  • teacher / leraar是课程所在的课堂代码
  • course / vak是教师的身份证件
  • <center> <table> <tr> <td> <table> <tbody> <tr> <td> <font> TEACHER-ID </font> </td> <td> <font> <b> CLASSROOM ID </b> </font> </td> </tr> <tr> <td> <font> COURSE ID </font> </td> </tr> </tbody> </table> </td> </tr> </table> </center> / <CENTER><font size="3" face="Arial" color="#000000"> <BR></font> <font size="6" face="Arial" color="#0000FF"> 16AO4EIO1B &nbsp;</font> <font size="4" face="Arial"> IO1B </font> <BR> <TABLE border="3" rules="all" cellpadding="1" cellspacing="1"> <TR> <TD align="center"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial" color="#000000"> Maandag 29-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Dinsdag 30-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Woensdag 31-08 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Donderdag 01-09 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> Vrijdag 02-09 </font> </TD> </TR> </TABLE> </TD> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>1</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 8:30 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 9:20 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> WEBD </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>2</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 9:20 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 10:10 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021B</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> WEBD </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>3</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 10:25 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 11:15 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> DOODF000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK C212</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> PROJ-T </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>4</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 11:15 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:05 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> BLEEJ002 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B021B</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> MENT </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>5</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:05 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:55 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>6</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 12:55 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 13:45 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> JONGJ003 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B008</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> BURG </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>7</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 13:45 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 14:35 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> FLUIP000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B004</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> ICT algemeen Prakti </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>8</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 14:50 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 15:40 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=4 align="center" nowrap="1"> <TABLE> <TR> <TD width="50%" nowrap=1><font size="2" face="Arial"> KOOLE000 </font> </TD> <TD width="50%" nowrap=1><font size="2" face="Arial"> <B>ALK B008</B> </font> </TD> </TR> <TR> <TD colspan="2" width="50%" nowrap=1><font size="2" face="Arial"> NED </font> </TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>9</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 15:40 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 16:30 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> <TR> <TD rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD align="center" rowspan="2" nowrap=1><font size="3" face="Arial"> <B>10</B> </font> </TD> <TD align="center" nowrap=1><font size="2" face="Arial"> 16:30 </font> </TD> </TR> <TR> <TD align="center" nowrap=1><font size="2" face="Arial"> 17:20 </font> </TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> <TD colspan=12 rowspan=2 align="center" nowrap="1"> <TABLE> <TR> <TD></TD> </TR> </TABLE> </TD> </TR> <TR> </TR> </TABLE> <TABLE cellspacing="1" cellpadding="1"> <TR> <TD valign=bottom> <font size="4" face="Arial" color="#0000FF"></TR></TABLE><font size="3" face="Arial"> Periode1 29-08-2016 (35) - 04-09-2016 (35) G r u b e r &amp; P e t t e r s S o f t w a r e </font></CENTER> 是课程的ID

以上数据的基本HTML结构

from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")
daytable = {
    1: "Maandag",
    2: "Dinsdag",
    3: "Woensdag",
    4: "Donderdag",
    5: "Vrijdag"
}
timetable = {
    1: ("8:30", "9:20"),
    2: ("9:20", "10:10"),
    3: ("10:25", "11:15"),
    4: ("11:15", "12:05"),
    5: ("12:05", "12:55"),
    6: ("12:55", "13:45"),
    7: ("13:45", "14:35"),
    8: ("14:50", "15:40"),
    9: ("15:40", "16:30"),
    10: ("16:30", "17:20"),
}

page = BeautifulSoup(r.content, "lxml")

roster = []
big_rows = 2
last_row_big = False
# There are 10 blocks, each made up out of 2 TR's, run through them
for block_count in range(2, 22, 2):
    # There are 5 days, first column is not data we want
    for day in range(2, 7):
        dayroster = {
            "dag": 0,
            "blok_start": 0,
            "blok_eind": 0,
            "lokaal": "",
            "leraar": "",
            "vak": ""
        }
        # This selector provides the classroom
        table_bold = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font > b")

        # This selector provides the teacher's code and the course ID
        table = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ") > table > tr > td > font")

        # This gets the rowspan on the current row and column
        rowspan = page.select(
            "html > body > center > table > tr:nth-of-type(" + str(block_count) + ") > td:nth-of-type(" + str(
                day) + ")")

        try:
            if table or table_bold and rowspan[0].attrs.get("rowspan") == "4":
                last_row_big = True
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2) + 1
            else:
                last_row_big = False
                # Setting end of class
                dayroster["blok_eind"] = (block_count // 2)
        except IndexError:
            pass

        if table_bold:
            x = table_bold[0]
            # Classroom ID
            dayroster["lokaal"] = x.contents[0]

        if table:
            iter = 0
            for x in table:
                content = x.contents[0].lstrip("\r\n").rstrip("\r\n")
                # Cell has data
                if content != "":
                    # Set start of class
                    dayroster["blok_start"] = block_count // 2
                    # Set day of class
                    dayroster["dag"] = day - 1
                    if iter == 0:
                        # Teacher ID
                        dayroster["leraar"] = content
                    elif iter == 1:
                        # Course ID
                        dayroster["vak"] = content
                    iter += 1

        if table or table_bold:
            # Store the data
            roster.append(dayroster)

# Remove duplicates
seen = set()
new_l = []
for d in roster:
    t = tuple(d.items())
    if t not in seen:
        seen.add(t)
        new_l.append(d)
pprint(new_l)

代码

HTML

<option></option>

的Python

<select>
  <option value></option>
  <option value="5">I have a small problem</option>
  <option value="10">I have a big problem</option>
  <option value="15">I have a massive problem</option
</select>

2 个答案:

答案 0 :(得分:12)

您必须跟踪前一行的行数,每列一行。

您可以通过将rowspan的整数值复制到字典中来执行此操作,后续行会减少rowspan值,直到它降至java/JDKx.x/lib(或者我们可以将整数值减1并放到{ {1}}以便于编码)。然后,您可以根据前面的rowpans调整后续表计数。

通过使用大小为2的默认跨度,以2为单位递增,您的表格会稍微复杂一点,但可以通过除以2轻松地将其恢复为可管理的数字。

不是使用大量的CSS选择器,而只选择表行,我们将迭代这些:

1

这会产生正确的输出:

0

此外,即使课程跨越超过2个区块,或只有一个区块,此代码仍将继续有效;支持任何rowspan大小。

答案 1 :(得分:2)

也许最好使用bs4内置函数,例如&#34; findAll &#34;解析你的桌子。

您可以使用以下代码:

from pprint import pprint
from bs4 import BeautifulSoup
import requests

r = requests.get("http://rooster.horizoncollege.nl/rstr/ECO/AMR/400-ECO/Roosters/36"
                 "/c/c00025.htm")

content=r.content
page = BeautifulSoup(content, "html")
table=page.find('table')
trs=table.findAll("tr", {},recursive=False)
tr_count=0
trs.pop(0)
final_table={}

for tr in trs:
    tds=tr.findAll("td", {},recursive=False)
    if tds:
        td_count=0
        tds.pop(0)
        for td in tds:
            if td.has_attr('rowspan'):                              
                final_table[str(tr_count)+"-"+str(td_count)]=td.text.strip()
                if int(td.attrs['rowspan'])==4:
                    final_table[str(tr_count+1)+"-"+str(td_count)]=td.text.strip()
                if final_table.has_key(str(tr_count)+"-"+str(td_count+1)):
                    td_count=td_count+1         
            td_count=td_count+1
        tr_count=tr_count+1

roster=[]
for i in range(0,10): #iterate over time
    for j in range(0,5): #iterate over day
        item=final_table[str(i)+"-"+str(j)]
        if len(item)!=0:    
            block_eind=i+1          

            try:
                if final_table[str(i+1)+"-"+str(j)]==final_table[str(i)+"-"+str(j)]:
                        block_eind=i+2
            except:
                pass

            try:
                lokaal=item.split('\r\n \n\n')[0]
                leraar=item.split('\r\n \n\n')[1].split('\n \n\r\n')[0]
                vak=item.split('\n \n\r\n')[1]
            except:
                lokaal=leraar=vak="---"

            dayroster = {
                "dag": j+1,
                "blok_start": i+1,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            }


            dayroster_double = {
                "dag": j+1,
                "blok_start": i,
                "blok_eind": block_eind,
                "lokaal": lokaal,
                "leraar": leraar,
                "vak": vak
            }

            #use to prevent double dict for same event
            if dayroster_double not in roster:
                roster.append(dayroster)

print (roster)