这里相对较新的BS4
我有以下HTML(为简洁和URL Psuedo而截断):
<tbody>
<tr>
<th >Part1</th>
<td>
<a href="http://somewebpage.com">87</a>
</td>
<td>
<a href="http://somewebpage.com">7</a>
</td>
<th>Part2</th>
<td>
<a href="http://somewebpage.com"">68</a>
</td>........
使用以下内容:
`soup=BeautifulSoup(page['content'], "html.parser")
table = soup.find("table")
table_data = [[cell.text for cell in row("td")]
for row in table("tr")]
pprint(table_data) `
table_data看起来像这样:
[[],
[u'87', u'7'],
[u'68'],
如何让'Part1'和'Part2'出现在同一个列表中?
对不起烦恼; - )
预期产出:
[[],
[u'Part1',u'87', u'7'],
[u'Part2', u'68'],
答案 0 :(得分:0)
您的表格结构不正确。遵循以下格式正确构建表格:https://www.w3schools.com/tags/tag_thead.asp
想象一下,如果你的表格结构如下:
content = """<table>
<thead>
<tr>
<th>Month</th>
<th>Savings</th>
</tr>
</thead>
<tfoot>
<tr>
<td>Sum</td>
<td>$180</td>
</tr>
</tfoot>
<tbody>
<tr>
<td>January</td>
<td>$100</td>
</tr>
<tr>
<td>February</td>
<td>$80</td>
</tr>
</tbody>
</table>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(content, "html.parser")
table = soup.find("table")
print([header.text for header in soup.find("table").find("thead").find_all("th")])
for row in soup.find("table").find("tbody").find_all("tr"):
print([data.text for data in row.find_all("td")])
print([footer.text for footer in soup.find("table").find("tfoot").find_all("td")])
输出
['Month', 'Savings']
['January', '$100']
['February', '$80']
['Sum', '$180']
答案 1 :(得分:0)
如果您的'表格数据如下所示:'部分是您想要的值,并且您只想“展平”列表,请尝试:
2d_list = [[], [u'87', u'7'], [u'68']]
1d_list = [x for y in 2d_list for x in y]
导致:[u'87, u'7', u'68']