Beautiful Soup的HTML数据需要格式化

时间:2016-05-13 14:13:48

标签: python python-2.7 beautifulsoup

我有一个来自Nose的html格式的测试报告文件。我想在Python中从中提取文本的一些部分。我将在邮件部分的电子邮件中发送此邮件。

我有以下样本:

<table>
        <tr>
            <th>Class</th>
            <th class="failed">Fail</th>
            <th class="failed">Error</th>
            <th>Skip</th>
            <th>Success</th>
            <th>Total</th>
        </tr>
            <tr>
                <td>Regression_TestCase</td>
                <td class="failed">1</td>
                <td class="failed">9</td>
                <td>0</td>
                <td>219</td>
                <td>229</td>
            </tr>
        <tr>
            <td><strong>Total</strong></td>
            <td class="failed">1</td>
            <td class="failed">9</td>
            <td>0</td>
            <td>219</td>
            <td>229</td>
        </tr>
    </table>

如果我在浏览器中打开文件,我想要的文本格式如下:这是我想从html文件中提取的文本。

    Class             Fail Error    Skip    Success     Total
Regression_TestCase     1    9       0      219         229

在Python27中使用BeautifulSoup4我设法提取以下内容:

[<th>Class</th>, <th class="failed">Fail</th>, <th class="failed">Error</th>, <th>Skip</th>, <th>Success</th>, <th>Total</th>]

[<td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>, <td class="failed">1</td>, <td class="failed">9</td>, <td>0</td>, <td>219</td>, <td>229</td>, <td><strong>Total</strong></td>, <td class="failed">1</td>, <td class="failed">9</td>, <td>0</td>, <td>219</td>, <td>229</td>]

我的代码如下:

def extract_pass_summary_from_selenium_report():
    html_report = open(r"C:\test_runners\selenium_regression_test_5_1_1\ClearCore 501 - Regression Test\TestReport\SeleniumTestReport.html",'r').read()
    soup = BeautifulSoup(html_report, "html.parser")

    print soup.find_all('th')

    print soup.find_all('td')

如何提取文本并保持格式如下:?

    Class             Fail Error    Skip    Success     Total
Regression_TestCase     1    9       0      219         229

谢谢Riaz

2 个答案:

答案 0 :(得分:3)

您可以单独使用BeautifulSoup解决此问题,但我会使用pandas并使用pandas.read_html()将HTML表解析为方便的数据框:

from StringIO import StringIO

import pandas as pd

data = """
<table>
        <tr>
            <th>Class</th>
            <th class="failed">Fail</th>
            <th class="failed">Error</th>
            <th>Skip</th>
            <th>Success</th>
            <th>Total</th>
        </tr>
            <tr>
                <td>Regression_TestCase</td>
                <td class="failed">1</td>
                <td class="failed">9</td>
                <td>0</td>
                <td>219</td>
                <td>229</td>
            </tr>
        <tr>
            <td><strong>Total</strong></td>
            <td class="failed">1</td>
            <td class="failed">9</td>
            <td>0</td>
            <td>219</td>
            <td>229</td>
        </tr>
    </table>"""

df = pd.read_html(StringIO(data))
print(df)

打印:

[                     0     1      2     3        4      5
0                Class  Fail  Error  Skip  Success  Total
1  Regression_TestCase     1      9     0      219    229
2                Total     1      9     0      219    229]

答案 1 :(得分:0)

添加功能

def html_to_text(html):
    records = []
    for i in range(len(html)):
        html[i] = html[i].text
        records.append(html[i])
    return records

调用代码中的函数

ths = soup.find_all('th')
ths = html_to_text(ths)
print(ths)
tds = html_to_text(soup.find_all('td'))
print(tds)