(更新) 我试图解析一些html表,我有一个问题来划分行和列。 我试图提取一些html文件的表: (http://www.sec.gov/Archives/edgar/data/5094/000095012313004020/h30303def14a.htm)
所以我得到了html,然后使用美丽的汤给我表:
soup=BeautifulSoup(table)
然后我有一个用来分隔行和列的函数:data=collapsetable(soup)
我使用背景颜色来分隔行,但我不确定如何将没有背景颜色的表分隔为行分隔符。
def collapsetable(soup,combine_rows=True):
rows=[]
lastcolor=None
for tr in soup('tr'):
try:
color=tr['bgcolor']
except:
color=''
row=[]
for td in tr('th')+tr('td'):
try:
span=int(td['colspan'])
except:
span=1
try:
color=td['bgcolor']
except:
pass
datum=''.join([getdeepcontent(t) for t in td.contents])
row+=[datum]+['']*(span-1)
# Use Colors to find the row split
if color==lastcolor and combine_rows:
for i in range(len(row)):
if i>=len(rows[-1]):
rows[-1].append(row[i])
else:
rows[-1][i]+=' '+row[i]
else:
rows.append(row)
lastcolor=color
clean_rows(rows)
return rows
例如,我在这个文件中想要的html表就是拥有"独立受托者的那个:"标题。 通过我的功能,我将获得所有列,但不知道将行分开的位置。
例如,这里是其中一个表的html部分:
<table border="0" width="100%" align="center" cellpadding="0" cellspacing="0" style="font-size: 8pt; font-family: 'Times New Roman', Times; color: #000000; background: transparent"><!-- Table Width Row BEGIN --><tr style="font-size: 1pt" valign="bottom"> <td width="25%"> </td> <!-- colindex=01 type=maindata --> <td width="1%"> </td> <!-- colindex=02 type=gutter --> <td width="6%"> </td> <!-- colindex=02 type=maindata --> <td width="2%"> </td> <!-- colindex=03 type=gutter --> <td width="9%"> </td> <!-- colindex=03 type=maindata --> <td width="1%"> </td> <!-- colindex=04 type=gutter --> <td width="23%"> </td> <!-- colindex=04 type=maindata --> <td width="2%"> </td> <!-- colindex=05 type=gutter --> <td width="6%"> </td> <!-- colindex=05 type=maindata --> <td width="2%"> </td> <!-- colindex=06 type=gutter --> <td width="23%"> </td> <!-- colindex=06 type=maindata --></tr><!-- Table Width Row END --><!-- TableOutputHead --><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Number of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Funds in<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Fund<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Position(s)<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Term of Office<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> </td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Complex<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Other Directorships<br /> </b></td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"> <b>Name and Year of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Held with<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>and Length of<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Principal Occupation(s)<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Overseen<br /> </b></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"> <b>Held by Trustee<br /> </b></td></tr><tr style="font-size: 8pt" valign="bottom" align="center"><td nowrap="nowrap" align="left" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Birth of Trustee</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Funds</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>Time Served</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>During the Past Five Years</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>by Trustee</b></div></td><td> </td><td nowrap="nowrap" align="center" valign="bottom"><div style="border-bottom: 1px solid #000000; width: 1%; padding-bottom: 1px"> <b>During the Past Five Years</b></div></td></tr><tr style="line-height: 3pt; font-size: 1pt"><td> </td></tr><!-- TableOutputBody --><tr valign="bottom"><td align="left" valign="top"> David C. Arch (1945)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Chairman and Chief Executive Officer of Blistex Inc., a consumer health care products manufacturer. <br /> Formerly: Member of the Heartland Alliance Advisory Board, a nonprofit organization serving human needs based in Chicago.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 136</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Board member of the Illinois Manufacturers’ Association. Member of the Board of Visitors, Institute for the Humanities, University of Michigan.</td></tr><tr valign="bottom" style="line-height: 6pt"><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><tr valign="bottom"><td align="left" valign="top"> Jerry D. Choate (1938)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Retired. From 1995 to 1999, Chairman and Chief Executive Officer of the Allstate Corporation (“Allstate”) and Allstate Insurance Company. From 1994 to 1995, President and Chief Executive Officer of Allstate. Prior to 1994, various management positions at Allstate.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 13</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Director since 1998 and member of the governance and nominating committee, executive committee, compensation and management development committee and equity award committee, of Amgen Inc., a biotechnological company. Director since 1999 and member of the nominating and governance committee and compensation and executive committee, of Valero Energy Corporation, a crude oil refining and marketing company. Previously, from 2006 to 2007, Director and member of the compensation committee and audit committee, of H&R Block, a tax preparation services company.</td></tr><tr valign="bottom" style="line-height: 6pt"><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td></tr><tr valign="bottom"><td align="left" valign="top"> Linda Hutton Heagy<sup style="font-size: 85%; vertical-align: top">1</sup> (1948)</td><td> </td><td nowrap="nowrap" align="left" valign="top"> Trustee</td><td> </td><td nowrap="nowrap" align="center" valign="top"> †</td><td> </td><td align="left" valign="top"> Retired. Prior to June 2008, Managing Partner of Heidrick & Struggles, the second largest global executive search firm, and from 2001-2004, Regional Managing Director of U.S. operations at Heidrick & Struggles. Prior to 1997, Managing Partner of Ray & Berndtson, Inc., an executive recruiting firm. Prior to 1995, Executive Vice President of ABN AMRO, N.A., a bank holding company, with oversight for treasury management operations including all non-credit product pricing. Prior to 1990, experience includes Executive Vice President of The Exchange National Bank with oversight of treasury management including capital markets operations, Vice President of Northern Trust Company and a trainee at Price Waterhouse.</td><td> </td><td nowrap="nowrap" align="center" valign="top"> 13</td><td> </td><td align="left" valign="top"> Trustee/Managing General Partner of funds in the Fund Complex. Prior to 2010, Trustee on the University of Chicago Medical Center Board, Vice Chair of the Board of the YMCA of Metropolitan Chicago and a member of the Women’s Board of the University of Chicago.</td></tr></table>
非常感谢任何帮助。
答案 0 :(得分:0)
如果取消注释
print('{}: {}'.format(len(row), row))
在下面的代码中,您会看到像
这样的内容11: ['', '', '', '', '', '', '', '', '', '', '']
11: ['', '', '', '', '', '', '', '', u'Number of', '', '']
11: ['', '', '', '', '', '', '', '', u'Funds in', '', '']
11: ['', '', '', '', '', '', '', '', u'Fund', '', '']
11: ['', '', u'Position(s)', '', u'Term of Office', '', '', '', u'Complex', '', u'Other Directorships']
11: [u'Name and Year of', '', u'Held with', '', u'and Length of', '', u'Principal Occupation(s)', '', u'Overseen', '', u'Held by Trustee']
11: [u'Birth of Trustee', '', u'Funds', '', u'Time Served', '', u'During the Past Five Years', '', u'by Trustee', '', u'During the Past Five Years']
1: ['']
11: [u'David C. Arch (1945)', '', u'Trustee', '', u'\x86', '', u'Chairman and Chief Executive Officer of Blistex Inc., a consumer\n health care products manufacturer.Formerly: Member of the Heartland Alliance Advisory Board, a\n nonprofit organization serving human needs based in Chicago.', '', u'136', '', u'Trustee/Managing General Partner of funds in the Fund Complex.\n Board member of the Illinois Manufacturers\x92 Association.\n Member of the Board of Visitors, Institute for the Humanities,\n University of Michigan.']
11: ['', '', '', '', '', '', '', '', '', '', '']
这表明标题与行数据之间的行长度为1:
1: ['']
因此,不是使用bgcolor
来标识要合并的行,而是可以使用行的长度作为所有先前行需要合并的信号。
import bs4 as bs
import urllib2
def collapse(table):
result = []
rows = []
for tr in table('tr'):
row = []
for td in tr('th') + tr('td'):
try:
span = int(td['colspan'])
except KeyError:
span = 1
datum = ''.join(td.stripped_strings)
row.extend([datum] + [''] * (span - 1))
if row:
# print('{}: {}'.format(len(row), row))
if len(row) > 1:
if any(row):
rows.append(row)
else:
result.extend(combine(rows))
rows = []
if rows:
result.extend(rows)
return result
def combine(rows):
return [[' '.join(col) for col in zip(*rows)]]
# url = 'http://www.sec.gov/Archives/edgar/data/5094/000095012313004020/h30303def14a.htm'
# soup = bs.BeautifulSoup(urllib2.urlopen(url))
# used for developing/debugging
with open('/tmp/def14a.htm', 'r') as f:
soup = bs.BeautifulSoup(f.read())
for table in soup.find_all('table'):
print(collapse(table))
print('-' * 80)