我有以下html:
</tr><tr>
<td>
<span id="Grid_exdate_43">2/15/2005</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_43">0.08</span>
</td><td>
<span id="Grid_DeclDate_43">--</span>
</td><td>
<span id="Grid_RecDate_43">2/17/2005</span>
</td><td>
<span id="Grid_PayDate_43">3/10/2005</span>
</td>
</tr><tr>
<td>
<span id="Grid_exdate_44">11/15/2004</span>
</td><td>Cash</td><td>
<span id="Grid_CashAmount_44">3.08</span>
</td><td>
<span id="Grid_DeclDate_44">--</span>
</td><td>
<span id="Grid_RecDate_44">11/17/2004</span>
</td><td>
<span id="Grid_PayDate_44">12/2/2004</span>
</td>
</tr><tr>
每个部分都有相同的5个项目,即:Grid_exdate
,Grid_CashAmount
,Grid_DeclDate
,Grid_RecDate
,Grid_PayDate
。每个部分的每个id
都有一个跟随它的整数,每个部分都会递增。在上面的例子中,我们有第43和44节。
我需要能够将每个部分保存为pandas数据帧中的一行。数据框如下:
Grid_exdate Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
2/15/2005 0.08 -- 2/17/2005 3/10/2005
11/15/2004 3.08 -- 11/17/2004 12/2/2004
我对如何做到这一点感到茫然。
修改
好的,我已经设法解决了应该有用的事情:
def get_exdate(self, id):
return id and re.compile("Grid_exdate_").search(id)
df = pd.DataFrame()
exdate_list = []
for link in soup.find_all(id=self.get_exdate):
exdate_list.append(link.string)
df['Grid_exdate'] = exdate_list
因此,上面的代码使用正则表达式获取所有Grid_exdate_
值,将所有结果添加到列表中,然后将其作为列添加到数据框中。
生病了,只需创建其中的5个,每个字段一个。如果有人有更好的解决方案,请告诉我(这可能不是一个非常有效的方法)。否则这应该可以解决问题。
答案 0 :(得分:1)
您可以使用docs中的pandas a related post:
此功能会搜索
ANDROID_TARGET_ARCH
个元素,并且只搜索<table>
和<tr>
元素中的每个<th>
或<td>
个元素。桌子。<tr>
代表“表格数据”。
因此,在使用您的文件之前,您需要使用<th>
标记包装它:
<td>
然后使用第一个元素,因为<table>
从html读取表到列表:
<table>
your html
</table>
修改强>
如果要重命名列:
read_html
您将拥有df = pd.read_html('file.html')
In [444]: df[0]
Out[444]:
0 1 2 3 4 5
0 2/15/2005 Cash 0.08 -- 2/17/2005 3/10/2005
1 11/15/2004 Cash 3.08 -- 11/17/2004 12/2/2004
列,因为您将其作为单独的表格单元格:
df1 = df[0]
df1.columns = ["Grid_exdate", "Cash", "Grid_CashAmount", "Grid_DeclDate", "Grid_RecDate", "Grid_PayDate"]
然后您可以删除“现金”列或编辑初始表
'Cash'
答案 1 :(得分:1)
如果你不想使用pandas read_html
,你可以解析它更复杂:
import pandas as pd
from bs4 import BeautifulSoup
table = BeautifulSoup(open('test.html','r').read())
#generate header from first tr
h = [[td.span.get('id') for td in row.select('td') if td.span != None ]
for row in table.findAll('tr')]
#remove empty lists
h = [x for x in h if x != []]
header = h[0]
print header
['Grid_exdate_43', 'Grid_CashAmount_43', 'Grid_DeclDate_43', 'Grid_RecDate_43', 'Grid_PayDate_43']
#if generating header is problematic, you can specify them
#header = ['Grid_exdate', 'Grid_CashAmount', 'Grid_DeclDate', 'Grid_RecDate', 'Grid_PayDate' ]
#get content of table, remove td with text Cash
body = [[td.text.strip() for td in row.select('td') if td.text.strip() != 'Cash']
for row in table.findAll('tr')]
#remove empty lists
body = [x for x in body if x != []]
cols = zip(*body)
tbl_d = {name:col for name, col in zip(header,cols)}
df = pd.DataFrame(tbl_d, columns = header)
print df
Grid_exdate_43 Grid_CashAmount_43 Grid_DeclDate_43 Grid_RecDate_43 \
0 2/15/2005 0.08 -- 2/17/2005
1 11/15/2004 3.08 -- 11/17/2004
Grid_PayDate_43
0 3/10/2005
1 12/2/2004
#remove last 3 chars of column name
#more rename info:
#http://stackoverflow.com/questions/11346283/renaming-columns-in-pandas
df.rename(columns=lambda x: x[:-3], inplace=True)
#convert columns to datetime columns
df['Grid_exdate'] = pd.to_datetime(df['Grid_exdate'])
df['Grid_RecDate'] = pd.to_datetime(df['Grid_RecDate'])
df['Grid_PayDate'] = pd.to_datetime(df['Grid_PayDate'])
print df
Grid_exdate Grid_CashAmount Grid_DeclDate Grid_RecDate Grid_PayDate
0 2005-02-15 0.08 -- 2005-02-17 2005-03-10
1 2004-11-15 3.08 -- 2004-11-17 2004-12-02
答案 2 :(得分:0)
感谢所有人提供的建议解决方案。最后,我选择了以下似乎是最不复杂的解决方案:
def get_exdate(self, id):
return id and re.compile("Grid_exdate_").search(id)
df = pd.DataFrame()
exdate_list = []
for link in soup.find_all(id=self.get_exdate):
exdate_list.append(link.string)
df['Grid_exdate'] = exdate_list
使用Grid_exdate_
在html /汤中搜索以re.compile
开头的所有内容。然后将结果添加到数据帧。所以我只是为每个必填字段创建了一个re.compile
搜索,并将它们全部添加到具有正确列标题的数据框中。