以下是我试图抓取的HTML代码
<div class="data-point-container section-break">
# some other HTML div classes here which I don't need
<table class data-bind="showHidden: isData">
<!-- ko foreach : sections -->
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<thead>...</thead>
<tbody>...</tbody>
<!-- /ko -->
</table>
</div>
如何使用Pandas.read_html
抓取所有这些信息,将thead
作为标题,tbody
作为值?
修改
这是我试图抓取的网站,并将数据提取到Pandas Dataframe中。 Link here
答案 0 :(得分:1)
严格地说,根据table
元素规范,每个表one should not have more than one thead
element。
如果你仍然有这个thead
后面跟着相应的tbody
结构,我会迭代地解析它 - 这样的每个结构都进入它自己的 dataframe 。
工作示例:
import pandas as pd
from bs4 import BeautifulSoup
data = """
<div class="data-point-container section-break">
<table class data-bind="showHidden: isData">
<thead>
<tr><th>Customer</th><th>Order</th><th>Month</th></tr>
</thead>
<tbody>
<tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
<tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
<tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
</tbody>
<thead>
<tr><th>Customer</th></tr>
</thead>
<tbody>
<tr><td>Customer 4</td></tr>
<tr><td>Customer 5</td></tr>
<tr><td>Customer 6</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
tbody = thead.find_next_sibling("tbody")
table = "<table>%s</table>" % (str(thead) + str(tbody))
df = pd.read_html(str(table))[0]
print(df)
print("-----")
打印2个数据帧 - 示例输入HTML中每个thead&amp; tbody一个:
Customer Order Month
0 Customer 1 #1 January
1 Customer 2 #2 April
2 Customer 3 #3 March
-----
Customer
0 Customer 4
1 Customer 5
2 Customer 6
-----
请注意,为了演示目的,我故意在每个块中使标头和数据单元的数量不同。