Question

我一直是Tabula-py的常用用户，但这次我遇到了一个案例，我不知道如何处理它。我要导入的表的结构（从PDF到df）类似于下面的结构： https://jsfiddle.net/acunw6hm/2/

    th, td {
        font-weight:normal;
        padding:5px;
        width:120px;
        vertical-align:top;
    }
    th {
        background:#00B0F0;
    }
    tr + tr th, tbody th {
        background:#DAEEF3;
    }
    tr + tr, tbody {
        text-align:left
    }
    table, th, td {
        border:solid 1px;
        border-collapse:collapse;
        table-layout:fixed;
    }

 

    <table>
        <thead>
            <tr>
                <th colspan="2">Main Header A </th>
                <th colspan="2">Main Header B</th>
                <th>Main Header C</th>
            </tr>
            <tr>
                <th>sub header A1</th>
                <th>sub header A2 </th>
                <th>sub header B1</th>
                <th>sub header B2</th>
                <th>sub header C</th>
     
            </tr>
        </thead>
        <tbody>
            <tr>
                <th>...</th>
         <th>...</th> <th>...</th> <th>...</th>
            <th>...</th>
     
            </tr>
     
        </tbody>
    </table>

基本上，第一行（和标题）是一些子标题的许多标题（在这种情况下为2）。我已经尝试过两种猜测=假|真，而塔布拉则以不同的方式搞砸了。对于guess = False，它实际上会合并一列中的子标题A1子标题A2（当guess = True时）。

有关如何处理此案例的任何想法或配置？ Thansks！亚历

使用Tabula从PDF中使用嵌套标题提取表

0 个答案: