我正在使用Pandas read_html从多个html文件中读取表格,并使用Pandas的ExcelWriter将它们相互放在一个Excel文件中。
我遇到的问题是每个文件在要删除的表上方都有14行垃圾数据;我发现建议使用跳过行的线程会删除表上方的数据,但还会删除表中的前14行。
任何帮助或建议,将不胜感激。
这是我的read_html电话:
for i in os.listdir(dl):
if "Export" in i:
for df in pd.read_html(i, skiprows = 14, index_col = 0):
df_list.append(df)
dfs = pd.concat(df_list)
这是我的文件格式,其中包含几行垃圾数据和下面的表格:
================================================ ===========
GPF采购订单预测
生成日期:2018-08-30
订单日期:2018-09-08
交货日期注册
供应商编号:全部
仓库:所有
================================================ ===========
仓库项目编号项目描述UPC编号包装尺寸预测
XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
html文件的前100行:
<!-- For export to excel style needs to be written on the page-->
<style type="text/css">
.Header
{
font-weight: bold;
}
.HeadUnderline
{
font-weight: bold;
text-decoration: underline;
}
</style>
</head>
<body id="portal">
<form name="frmMain" method="post" action="Export.aspx?DcNbr=0&VendorNbr=0&OrdDate=2018-09-01&GenDate=2018-08-30&DivNbr=0&DelDate=0000-00-00" id="frmMain">
<div>
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTg0NDMyMzg5OGQYAQUJZ3ZSZXN1bHRzDzwrAAwBCAIBZC77FhJcYYUB/Yk3jdfFNSAWWS9MSP5BghZFEKqOFLXh" />
<!-- c1under - to use this page as a popup window without the header change the id from rlHeader
to rlStyle. The rlFooter literal could be removed if you do not want the footer on the popup window.
-->
<div id="main-content-area" style="vertical-align: top;">
<table width="100%" border="0" bordercolor="#FFCC00" cellpadding="0" cellspacing="0" align="center" style="vertical-align: top">
<tr style="vertical-align: top" align="center">
<td style="vertical-align: top; border: solid 2 black;" align="center" colspan="8">
<span id="lblAppTitle" class="HeadUnderline">GPF Purchase Order Forecasts</span>
</td>
</tr>
<tr>
<td colspan="8">
</td>
</tr>
<tr style="height: 27px">
<td align='right' colspan="8">
<span id="lblGenDate" class="Header">Generation Date:</span>
<span id="lblGenDateValue">2018-08-30</span>
</td>
</tr>
<tr>
<td colspan="8">
<span id="lblOrderDate" class="Header">Order Date:</span>
<span id="lblOrderDateValue">2018-09-01</span>
</td>
</tr>
<tr>
<td colspan="8">
<span id="lblDeliveryDate" class="Header">Delivery Date</span>
<span id="lblDeliveryDateValue">0000-00-00</span>
</td>
</tr>
<tr>
<td colspan="8">
</td>
</tr>
<tr style="height: 27px">
<td align="right" colspan="7">
<span id="lblVendorNumber" class="Header">Vendor No.:</span>
</td>
<td align="left">
<span id="lblVendorNumberValue">ALL</span>
</td>
</tr>
<tr>
<td id="vendorAddress" align="right"></td>
<td colspan="7">
</td>
</tr>
<tr>
<td colspan="8">
</td>
</tr>
<tr style="height: 27px">
<td align='right' colspan="7">
<span id="lblWarehouse" class="Header">Warehouse:</span>
</td>
<td align="left">
<span id="lblWarehouseValue">ALL</span>
</td>
</tr>
<tr>
<td id="depotAddress" align="left" colspan="8"></td>
</tr>
<tr>
<td colspan="8">
</td>
</tr>
</table>
<table cellspacing="0" cellpadding="0" border="0">
答案 0 :(得分:0)
尝试一下:
for i in os.listdir(dl):
if "Export" in i:
# Read out all html tables into list of dataframes
data = pd.read_html(i, index_col=0)
# Drop first table containing junk data
data = data[1:]
# Merge with already existing list of dataframes
df_list += data
dfs = pd.concat(df_list)