BeautifulSoup my for循环正在打印td标签中的所有数据。我想排除td标签的最后一部分

时间:2016-05-14 21:53:26

标签: python-2.7 beautifulsoup

我有一个来自html文件的以下示例HTML表。

openssh

我正在尝试从<table> <tr> <th>Class</th> <th class="failed">Fail</th> <th class="failed">Error</th> <th>Skip</th> <th>Success</th> <th>Total</th> </tr> <tr> <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td> <td class="failed">1</td> <td class="failed">9</td> <td>0</td> <td>219</td> <td>229</td> </tr> <tr> <td><strong>Total</strong></td> <td class="failed">1</td> <td class="failed">9</td> <td>0</td> <td>219</td> <td>229</td> </tr> </table> 开始的<td>代码中打印文字:     Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2

我不想在<td>开头的<td>标记中加入文字:

<td>

我的代码正在打印每个<td><strong>Total</strong></td> 代码中的文字:

<td>

我想要的输出:

def extract_data_from_report():
    html_report = open(r"E:\SeleniumTestReport.html",'r').read()
    soup = BeautifulSoup(html_report, "html.parser")
    th = soup.find_all('th')
    td = soup.find_all('td')

    for item in th:
        print item.text,
    print "\n"
    for item in td:
        print item.text,

1 个答案:

答案 0 :(得分:1)

您可以找到除第一个(跳过标题)以外的所有行(tr元素)和最后一个 - &#34;总计&#34;行。因此产生字典列表的示例实现:

from pprint import pprint

from bs4 import BeautifulSoup


data = """
<table>
    <tr>
        <th>Class</th>
        <th class="failed">Fail</th>
        <th class="failed">Error</th>
        <th>Skip</th>
        <th>Success</th>
        <th>Total</th>
    </tr>
        <tr>
            <td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
            <td class="failed">1</td>
            <td class="failed">9</td>
            <td>0</td>
            <td>219</td>
            <td>229</td>
        </tr>
    <tr>
        <td><strong>Total</strong></td>
        <td class="failed">1</td>
        <td class="failed">9</td>
        <td>0</td>
        <td>219</td>
        <td>229</td>
    </tr>
</table>"""

soup = BeautifulSoup(data, "html.parser")

headers = [header.get_text(strip=True) for header in soup.find_all("th")]
rows = [dict(zip(headers, [td.get_text(strip=True) for td in row.find_all("td")]))
        for row in soup.find_all("tr")[1:-1]]

pprint(rows)

打印:

[{u'Class': u'Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2',
  u'Error': u'9',
  u'Fail': u'1',
  u'Skip': u'0',
  u'Success': u'219',
  u'Total': u'229'}]