如何在Python Pandas中跳过表格上方的行

时间:2018-09-09 19:27:24

标签: python pandas

我正在使用Pandas read_html从多个html文件中读取表格,并使用Pandas的ExcelWriter将它们相互放在一个Excel文件中。

我遇到的问题是每个文件在要删除的表上方都有14行垃圾数据;我发现建议使用跳过行的线程会删除表上方的数据,但还会删除表中的前14行。

  • 有人对我如何摆脱表格上方的行而又不丢失表格中的任何行有任何建议吗?
  • 此外,我已经使用index_col = 0来消除行上的索引,但是我找不到语法来消除沿列的索引?

任何帮助或建议,将不胜感激。

这是我的read_html电话:

for i in os.listdir(dl):
    if "Export" in i:
        for df in pd.read_html(i, skiprows = 14, index_col = 0):
            df_list.append(df)
dfs = pd.concat(df_list)

这是我的文件格式,其中包含几行垃圾数据和下面的表格:

================================================ ===========

GPF采购订单预测

生成日期:2018-08-30
订单日期:2018-09-08
交货日期注册

供应商编号:全部

仓库:所有

================================================ ===========

仓库项目编号项目描述UPC编号包装尺寸预测

XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX

html文件的前100行:

<!-- For export to excel style needs to be written on the page-->

<style type="text/css">

    .Header

    {

        font-weight: bold;

    }

    .HeadUnderline

    {

        font-weight: bold;

        text-decoration: underline;

    }

</style>

</head>

<body id="portal">

<form name="frmMain" method="post" action="Export.aspx?DcNbr=0&amp;VendorNbr=0&amp;OrdDate=2018-09-01&amp;GenDate=2018-08-30&amp;DivNbr=0&amp;DelDate=0000-00-00" id="frmMain">

<div>

<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTg0NDMyMzg5OGQYAQUJZ3ZSZXN1bHRzDzwrAAwBCAIBZC77FhJcYYUB/Yk3jdfFNSAWWS9MSP5BghZFEKqOFLXh" />

         
<!-- c1under - to use this page as a popup window without the header change the id from rlHeader 

    to rlStyle. The rlFooter literal could be removed if you do not want the footer on the popup window.

     -->

<div id="main-content-area" style="vertical-align: top;">

    <table width="100%" border="0" bordercolor="#FFCC00" cellpadding="0" cellspacing="0" align="center" style="vertical-align: top">

        <tr style="vertical-align: top" align="center">

            <td style="vertical-align: top; border: solid 2 black;" align="center" colspan="8">

                <span id="lblAppTitle" class="HeadUnderline">GPF Purchase Order Forecasts</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align='right' colspan="8">

                <span id="lblGenDate" class="Header">Generation Date:</span>&nbsp;

                <span id="lblGenDateValue">2018-08-30</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                <span id="lblOrderDate" class="Header">Order Date:</span>&nbsp;

                <span id="lblOrderDateValue">2018-09-01</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                <span id="lblDeliveryDate" class="Header">Delivery Date</span>&nbsp;

                <span id="lblDeliveryDateValue">0000-00-00</span>

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align="right" colspan="7">

                <span id="lblVendorNumber" class="Header">Vendor No.:</span>&nbsp;

            </td>

            <td align="left">

                <span id="lblVendorNumberValue">ALL</span>

            </td>

        </tr>

        <tr>

            <td id="vendorAddress" align="right"></td>



            <td colspan="7">

            </td>

        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

        <tr style="height: 27px">

            <td align='right' colspan="7">

                <span id="lblWarehouse" class="Header">Warehouse:</span>&nbsp;

            </td>

            <td align="left">

                <span id="lblWarehouseValue">ALL</span>

            </td>

        </tr>

        <tr>

            <td id="depotAddress" align="left" colspan="8"></td>



        </tr>

        <tr>

            <td colspan="8">

                &nbsp;

            </td>

        </tr>

    </table>

    <table cellspacing="0" cellpadding="0" border="0">

1 个答案:

答案 0 :(得分:0)

尝试一下:

for i in os.listdir(dl):
    if "Export" in i:
       # Read out all html tables into list of dataframes
       data = pd.read_html(i, index_col=0)

       # Drop first table containing junk data
       data = data[1:]

       # Merge with already existing list of dataframes
       df_list += data

dfs = pd.concat(df_list)