我有一个来自html文件的以下示例HTML表。
openssh
我正在尝试从<table>
<tr>
<th>Class</th>
<th class="failed">Fail</th>
<th class="failed">Error</th>
<th>Skip</th>
<th>Success</th>
<th>Total</th>
</tr>
<tr>
<td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
</table>
开始的<td>
代码中打印文字:
Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2
我不想在<td>
开头的<td>
标记中加入文字:
<td>
我的代码正在打印每个<td><strong>Total</strong></td>
代码中的文字:
<td>
我想要的输出:
def extract_data_from_report():
html_report = open(r"E:\SeleniumTestReport.html",'r').read()
soup = BeautifulSoup(html_report, "html.parser")
th = soup.find_all('th')
td = soup.find_all('td')
for item in th:
print item.text,
print "\n"
for item in td:
print item.text,
答案 0 :(得分:1)
您可以找到除第一个(跳过标题)以外的所有行(tr
元素)和最后一个 - &#34;总计&#34;行。因此产生字典列表的示例实现:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<th>Class</th>
<th class="failed">Fail</th>
<th class="failed">Error</th>
<th>Skip</th>
<th>Success</th>
<th>Total</th>
</tr>
<tr>
<td>Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2</td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td class="failed">1</td>
<td class="failed">9</td>
<td>0</td>
<td>219</td>
<td>229</td>
</tr>
</table>"""
soup = BeautifulSoup(data, "html.parser")
headers = [header.get_text(strip=True) for header in soup.find_all("th")]
rows = [dict(zip(headers, [td.get_text(strip=True) for td in row.find_all("td")]))
for row in soup.find_all("tr")[1:-1]]
pprint(rows)
打印:
[{u'Class': u'Regression_TestCase.RegressionProject_TestCase2.RegressionProject_TestCase2',
u'Error': u'9',
u'Fail': u'1',
u'Skip': u'0',
u'Success': u'219',
u'Total': u'229'}]