python selenium scraping tbody

时间:2016-07-17 02:50:32

标签: python selenium pandas

以下是我试图抓取的HTML代码

<div class="data-point-container section-break">
    # some other HTML div classes here which I don't need
    <table class data-bind="showHidden: isData">
          <!-- ko foreach : sections -->
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
        <thead>...</thead>
        <tbody>...</tbody>
          <!-- /ko -->
    </table>
</div>

如何使用Pandas.read_html抓取所有这些信息,将thead作为标题,tbody作为值?

修改

这是我试图抓取的网站,并将数据提取到Pandas Dataframe中。 Link here

1 个答案:

答案 0 :(得分:1)

严格地说,根据table元素规范,每个表one should not have more than one thead element

如果你仍然有这个thead后面跟着相应的tbody结构,我会迭代地解析它 - 这样的每个结构都进入它自己的 dataframe

工作示例:

import pandas as pd
from bs4 import BeautifulSoup

data = """
<div class="data-point-container section-break">
    <table class data-bind="showHidden: isData">

        <thead>
            <tr><th>Customer</th><th>Order</th><th>Month</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 1</td><td>#1</td><td>January</td></tr>
            <tr><td>Customer 2</td><td>#2</td><td>April</td></tr>
            <tr><td>Customer 3</td><td>#3</td><td>March</td></tr>
        </tbody>

        <thead>
            <tr><th>Customer</th></tr>
        </thead>
        <tbody>
            <tr><td>Customer 4</td></tr>
            <tr><td>Customer 5</td></tr>
            <tr><td>Customer 6</td></tr>
        </tbody>

    </table>
</div>
"""

soup = BeautifulSoup(data, "html.parser")
for thead in soup.select(".data-point-container table thead"):
    tbody = thead.find_next_sibling("tbody")

    table = "<table>%s</table>" % (str(thead) + str(tbody))

    df = pd.read_html(str(table))[0]
    print(df)
    print("-----")

打印2个数据帧 - 示例输入HTML中每个thead&amp; tbody一个:

     Customer Order    Month
0  Customer 1    #1  January
1  Customer 2    #2    April
2  Customer 3    #3    March
-----
     Customer
0  Customer 4
1  Customer 5
2  Customer 6
-----

请注意,为了演示目的,我故意在每个块中使标头和数据单元的数量不同。