Question

如何使用BeautifulSoup提取表及其值？尝试遵循bs4文档，并且在查找class或th值时遇到问题。如何从整个HTML页面中明确获取{underReplicatedBlocks}值。

<div class="page-header"><h1><small>Decommissioning</small></h1></div>
<small>
<table class="table">
  <thead>
    <tr>
      <th>Node</th>
      <th>Last contact</th>
      <th>Under replicated blocks</th>
      <th>Blocks with no live replicas</th>
      <th>Under Replicated Blocks <br/>In files under construction</th>
    </tr>
  </thead>
  {#DecomNodes}
  <tr>
    <td>{name} ({xferaddr})</td>
    <td>{lastContact}</td>
    <td>{underReplicatedBlocks}</td>
    <td>{decommissionOnlyReplicas}</td>
    <td>{underReplicateInOpenFiles}</td>
  </tr>
  {/DecomNodes}
</table>
</small>

Answer 1

如果您要抓取的文档中的tr属性位于每个3行中，则可以使用此选项：

rows = soup.findAll('tr')[2::3]

Answer 2

由于没有您想要的标签的特殊类，您将不得不通过查看HTML并对其进行硬编码来获取索引。查看该表，并检查哪一行（<tr>）是所需的文本;同样为列做。

由于它位于第二行和第三列，您必须使用它：

table = soup.find('table', class_='table')
rows = table.find_all('tr')
required_row = rows[1]
columns = required_row.find_all('td')
required_column = columns[2]
required_text = required_column.text

或者，简单地说：

required_text = table.find_all('tr')[1].find_all('td')[2].text
print(required_text)
# {underReplicatedBlocks}

如何使用BeautifulSoup4提取表及其值

2 个答案: