Python web scraping table with sub headings

时间:2017-07-03 13:08:20

标签: python html python-2.7 web-scraping

I am trying to extract some information from a table which appears on various webpages (My apologies for not disclosing the webpage).

<table class="toccolours" style="font-size: 85%;">
 <tbody><tr>
 <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>First sub-class</b></th>
 </tr>
  <tr>
 <td style="padding-right:5px">Info1</td>
 <td style="padding-right:5px;"><a title="Object 1">Object 1</a></td>
 <td style="text-align:center;padding-right:5px">Info 2</td>
 <td style="padding-right:5px"><a title="Object 2">Object 2</a></td>
 <td style="padding-right:5px">Info 3</td>
 <td style="text-align:center;">Info 4</td>
 <td style="text-align:center;">Info 5</td>
 <td></td>
 </tr>
<tr>
 <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Second sub-class</b></th>
 </tr>
 <tr>
 <td style="padding-right:5px">Info11</td>
 <td style="padding-right:5px;"><a title="Object 11">Object 11</a></td>
 <td style="text-align:center;padding-right:5px">Info 22</td>
 <td style="padding-right:5px"><a title="Object 22">Object 22</a></td>
 <td style="padding-right:5px">Info 33</td>
 <td style="text-align:center;">Info 44</td>
 <td style="text-align:center;">Info 55</td>
 <td></td>
 </tr>
 <tr>
 <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Third sub-class</b></th>
 </tr>
 <tr>
 <td style="padding-right:5px">Info 111</td>
 <td style="padding-right:5px;"><a title="Object 111">Object 111</a></td>
 <td style="text-align:center;padding-right:5px">Info 222</td>
 <td style="padding-right:5px">Object 222</td>
 <td style="padding-right:5px">Info 333</td>
 <td style="text-align:center;">Info 444</td>
 <td style="text-align:center;">Info 555</td>
 <td></td>
 </tr>
 </tbody></table>

Where the table essentially looks like the following: Original Table

图片1

问题是每个子类的子类和行数都可能会发生变化。因此,例如,第一子类在某些情况下可能有1个项目,第二个子类可能有3个项目,第三个子类可能有2个项目。另外,我也可能得到一个只有1级和2级子的表。

例如:

Other example 1

图片2

OR

Other example 2

图片3

也是可能的。

我希望以一种格式获取数据,以便子类值以下列格式出现在相关信息行旁边(图1中):

desired solution

图片4

但是我对如何在python上实现这一点感到困惑,因为表标题不是每个行项目出现的单独类。我可以使用网络驱动程序调用网页,并使用美丽的汤提取页面源。但是,在这种情况下,我无法弄清楚如何将子类分配给行(特别是因为信息行不显示为子类行的元素,而只是作为表的新行)。

截至目前,我可以使用 .find_all('tr')来获取表格的所有行。但是,由于子类和行数在(500左右的表)中不一致,我似乎无法理解数据。任何帮助将不胜感激。

P

2 个答案:

答案 0 :(得分:1)

只需逐行处理您的HTML:

b = bs4.BeautifulSoup(html)
data = {}
current = None
for row in b.find_all('tr'):
    if row.find_all('th'):
        # this is a header
        current = row.find_all('th')[0].text
    else:
        # this is not a header, therefore is data under the last header seen
        data[current] = row.find_all('td') # do whatever processing you need to do here, you did't specify

如果您需要保留标题的顺序而不是字典,请使用列表列表:

data = []
headers = []

for row in b.find_all('tr'):
    if row.find_all('th'):
        # this is a header
        headers.append(row.find_all('th')[0].text)
        data.append([])
    else:
        # this is not a header, therefore is data under the last header seen
        data[-1].append(row.find_all('td'))
print zip(headers,data)

答案 1 :(得分:1)

我使用lxml希望它适合您的问题

from lxml import etree

html_body = """
<table class="toccolours" style="font-size: 85%;">
 <tbody>
 <tr>
    <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>First sub-class</b></th>
 </tr>
 <tr>
 <td style="padding-right:5px">Info1</td>
 <td style="padding-right:5px;"><a title="Object 1">Object 1</a></td>
 <td style="text-align:center;padding-right:5px">Info 2</td>
 <td style="padding-right:5px"><a title="Object 2">Object 2</a></td>
 <td style="padding-right:5px">Info 3</td>
 <td style="text-align:center;">Info 4</td>
 <td style="text-align:center;">Info 5</td>
 <td></td>
 </tr>
<tr>
 <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Second sub-class</b></th>
 </tr>
 <tr>
 <td style="padding-right:5px">Info11</td>
 <td style="padding-right:5px;"><a title="Object 11">Object 11</a></td>
 <td style="text-align:center;padding-right:5px">Info 22</td>
 <td style="padding-right:5px"><a title="Object 22">Object 22</a></td>
 <td style="padding-right:5px">Info 33</td>
 <td style="text-align:center;">Info 44</td>
 <td style="text-align:center;">Info 55</td>
 <td></td>
 </tr>
 <tr>
 <th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Third sub-class</b></th>
 </tr>
 <tr>
 <td style="padding-right:5px">Info 111</td>
 <td style="padding-right:5px;"><a title="Object 111">Object 111</a></td>
 <td style="text-align:center;padding-right:5px">Info 222</td>
 <td style="padding-right:5px">Object 222</td>
 <td style="padding-right:5px">Info 333</td>
 <td style="text-align:center;">Info 444</td>
 <td style="text-align:center;">Info 555</td>
 <td></td>
 </tr>
 </tbody></table>
"""

tableData = {}

tree = etree.fromstring(html_body, parser=etree.HTMLParser())
for i in tree.xpath("//tr/th[@colspan]"):
    className = i.getchildren()[0].text
    tableData[className] = []

    parentTag = i.getparent()
    tableBody = parentTag.getnext().xpath('td')
    for cell in tableBody:
        if cell.text:
            tableData[className].append(cell.text)
        else:
            child_tag = cell.getchildren()
            if child_tag:
                tableData[className].append(child_tag[0].text)

print tableData

输出:

>>> {'Second sub-class': ['Info11', 'Object 11', 'Info 22', 'Object 22', 'Info 33', 'Info 44', 'Info 55'], 'First sub-class': ['Info1', 'Object 1', 'Info 2', 'Object 2', 'Info 3', 'Info 4', 'Info 5'], 'Third sub-class': ['Info 111', 'Object 111', 'Info 222', 'Object 222', 'Info 333', 'Info 444', 'Info 555']}