BeautifulSoup来解析HTML表

时间:2017-01-17 05:57:10

标签: python html beautifulsoup

这是我第一次使用BeautifulSoup,我正在尝试解析HTML表。到目前为止,通过其他示例,我已经能够编写一些简单的代码来非常接近我需要的东西。但是,通过使用ele.text.strip(),我最终会丢失我想要保留的部分信息。

如下所示,这是我的代码现在的样子:

soup = BeautifulSoup(open("data_table.htm"))

table = soup.find("div", id="CT_Main_1_divResults")
table_body = table.find('tbody')
rows = table_body.find_all('tr')

data = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    data.append(cols)

结果:

[u'$4,090,000,000',
 u'13.61%',
 u'4,550,000',
 u'100 Grainger Pkwy.',
 u'',
 u'',
 u'']

我想也许我可以删除ele.text.strip()行,并使用相同的代码,如下所示:

data = []
for row in rows:
    cols = row.find_all('td')
    data.append(cols)

以下是以下结果:

[<td><span style="text-align: right; height: 36px;">$4,090,000,000</span></td>,
 <td><span style="text-align: right; height: 36px;">13.61%</span></td>,
 <td><span style="text-align: right; height: 36px;">4,550,000</span></td>,
 <td class=""><span style="text-align: right; height: 36px;">100 Grainger Pkwy.</span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/cancel.gif"/></span></td>,
 <td class="tdbrdrright"><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>]

解决这个问题的一种方法可能是使用第二个选项并进行一些花哨的字符串解析以获取我需要的东西,但我希望这是一个更好的方法。最后,我希望数据如下所示。如何调整代码才能实现此目的?

[u'$4,090,000,000',
 u'13.61%',
 u'4,550,000',
 u'100 Grainger Pkwy.',
 u'Inside%20the%20Databases.com_files/True.gif',
 u'Inside%20the%20Databases.com_files/calcel.gif',
 u'Inside%20the%20Databases.com_files/True.gif']

2 个答案:

答案 0 :(得分:1)

试一试。如果有多个img标记,文字以及img标记等,您需要根据自己的想法进行调整,但这应该让您开始正确的道路。

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("data-table.html"), 'html.parser')

table = soup.find("div", id="CT_Main_1_divResults")
table_body = table.find('tbody')
rows = table_body.find_all('tr')

data = []
for row in rows:
    cols = []
    for col in row.find_all('td'):
        t = col.text.strip()
        if not t:
            for img in row.find_all('img'):
                t = img.attrs['src']

        cols.append(t)
    data.append(cols)

print(data)

输出:

[[u'$4,090,000,000', u'13.61%', u'4,550,000', u'100 Grainger Pkwy.', u'Inside%20the%20Databases.com_files/True.gif', u'Inside%20the%20Databases.com_files/True.gif', u'Inside%20the%20Databases.com_files/True.gif']]

答案 1 :(得分:1)

import bs4

html = '''<td><span style="text-align: right; height: 36px;">$4,090,000,000</span></td>,
 <td><span style="text-align: right; height: 36px;">13.61%</span></td>,
 <td><span style="text-align: right; height: 36px;">4,550,000</span></td>,
 <td class=""><span style="text-align: right; height: 36px;">100 Grainger Pkwy.</span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>,
 <td><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/cancel.gif"/></span></td>,
 <td class="tdbrdrright"><span style="text-align: right; height: 36px;"><img src="Inside%20the%20Databases.com_files/True.gif"/></span></td>'''
soup = bs4.BeautifulSoup(html, 'lxml')

for td in soup('td'):
    if td.text:
        print(td.text)
    else:
        print(td.img.get('src'))

出:

$4,090,000,000
13.61%
4,550,000
100 Grainger Pkwy.
Inside%20the%20Databases.com_files/True.gif
Inside%20the%20Databases.com_files/cancel.gif
Inside%20the%20Databases.com_files/True.gif

print更改为append,您将获得此输出的列表。

您想要的遗失信息位于img标记的属性中,而不是文本。