用美丽的汤抓住所有行数据

时间:2017-07-11 14:25:26

标签: python beautifulsoup

这里相对较新的BS4

我有以下HTML(为简洁和URL Psuedo而截断):

    <tbody>
              <tr>
                <th >Part1</th>
                <td>
                  <a href="http://somewebpage.com">87</a>
</td>
                <td>
                  <a href="http://somewebpage.com">7</a>
                </td>
                <th>Part2</th>
                <td>
                  <a href="http://somewebpage.com"">68</a>
                </td>........

使用以下内容:

`soup=BeautifulSoup(page['content'], "html.parser")
 table = soup.find("table")
 table_data = [[cell.text for cell in row("td")]
 for row in table("tr")]
 pprint(table_data) `

table_data看起来像这样:

    [[],
 [u'87', u'7'],
 [u'68'],

如何让'Part1'和'Part2'出现在同一个列表中?

对不起烦恼; - )

预期产出:

[[],
     [u'Part1',u'87', u'7'],
     [u'Part2',  u'68'],

2 个答案:

答案 0 :(得分:0)

您的表格结构不正确。遵循以下格式正确构建表格:https://www.w3schools.com/tags/tag_thead.asp

想象一下,如果你的表格结构如下:

content = """<table>
 <thead>
  <tr>
     <th>Month</th>
     <th>Savings</th>
  </tr>
 </thead>
 <tfoot>
  <tr>
     <td>Sum</td>
     <td>$180</td>
  </tr>
 </tfoot>
 <tbody>
  <tr>
     <td>January</td>
     <td>$100</td>
  </tr>
  <tr>
     <td>February</td>
     <td>$80</td>
  </tr>
 </tbody>
</table>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, "html.parser")
table = soup.find("table")

print([header.text for header in soup.find("table").find("thead").find_all("th")])

for row in soup.find("table").find("tbody").find_all("tr"):
    print([data.text for data in row.find_all("td")])

print([footer.text for footer in soup.find("table").find("tfoot").find_all("td")])

输出

['Month', 'Savings']
['January', '$100']
['February', '$80']
['Sum', '$180']

答案 1 :(得分:0)

如果您的'表格数据如下所示:'部分是您想要的值,并且您只想“展平”列表,请尝试:

2d_list = [[], [u'87', u'7'], [u'68']]

1d_list = [x for y in 2d_list for x in y]

导致:[u'87, u'7', u'68']