我正在从{strong> 合并资产负债表 表中的sec uoip_10k抓取数据。每列用空白数据隔开1或2 td
。有没有办法识别这些空白td
。
当前,我正在做的是下面的事情。
def check_if_cell_seperator(cell):
if 'width' in str(cell):
width = int(cell["width"].strip('%').strip())
if width < 2 and cell.text.strip() == '':
return True
else:
return False
else:
return False
def main(url):
htmlpage = urllib.request.urlopen(url)
page = BeautifulSoup(htmlpage, "html.parser")
all_divtables = page.find_all('table')
# only taking data from 38th table
for i,table in enumerate(all_divtables[38:39]):
rows = table.find_all(['th', 'td'],recursive=False)
table_data = []
for tr in rows:
row_data=[]
cells = tr.find_all('td')
for cell in cells:
if check_if_cell_seperator(cell):
continue
else:
cell_data = cell.text
row_data.append(cell_data.encode('utf-8'))
table_data.append([x.decode('utf-8').strip() for x in row_data])
print(table_data)
但是这里的问题是从未指定单元格width
的行中抓取数据(例如,标题行)。
我有什么方法可以识别和删除td
,它们仅存在于单独的列中。
无法删除最终列表中的所有空格,因为这会影响缩进。
示例
<tr>
<td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt"> </font></td>
<td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold"> </font></td>
<td colspan="2" valign="bottom" style="BORDER-BOTTOM: black 2px solid">
<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">2015</font></div>
</td>
<td nowrap="" valign="bottom" style="TEXT-ALIGN: left; PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold"> </font></td>
<td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold"> </font></td>
<td colspan="2" valign="bottom" style="BORDER-BOTTOM: black 2px solid">
<div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">2014</font></div>
</td>
<td nowrap="" valign="bottom" style="TEXT-ALIGN: left; PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold"> </font></td>
</tr>
在上面的示例中,td
的2,4,5是空白tds,仅用于分隔列。
请帮助。
答案 0 :(得分:1)
下面的代码(在python 3.6下测试)跳过空单元格和空行。它还会跳过表头。
如果代码适合您,则可以删除调试打印。
import requests
from bs4 import BeautifulSoup
def main(url):
def _is_separator_cell(cell):
width = cell.attrs.get('width', None)
if width:
_width = int(cell.attrs.get('width')[:-1])
return _width <= 2
else:
return False
htmlpage = requests.get(url).content
page = BeautifulSoup(htmlpage, "html.parser")
all_divtables = page.find_all('table')
# only taking data from 38th table
for i, table in enumerate(all_divtables[38:39]):
rows = table.find_all('tr', recursive=False)
table_data = []
for r, tr in enumerate(rows):
row_data = []
print('DBG {}.'.format(r))
cells = tr.find_all('td')
is_header = len(cells) < 8
for c, cell in enumerate(cells):
data = cell.text.strip()
separator_cell = _is_separator_cell(cell)
print('\tDBG {}. [{}] (width: {})'.format(c, data, cell.attrs.get('width')))
if data or (not separator_cell and not is_header):
row_data.append(data)
if row_data:
table_data.append(row_data)
return table_data
table_data = main('https://www.sec.gov/Archives/edgar/data/1097718/000135448815004617/uoip_10k.htm')
print('results:')
for row in table_data:
print(row)
结果:
['June 30,', 'June 30,']
['2015', '2014']
['Assets']
['Current Assets:']
['Cash', '$', '21,745', '$', '56,827']
['Accounts receivable, net', '19,945', '84,091']
['Inventory', '-', '19,069']
['Prepaid expenses', '66,543', '136,927']
['Marketable securities', '2', '3']
['Other current assets', '10,208', '51,708']
['Total Current Assets', '118,443', '348,625']
['', '', '']
['Property and equipment, net of accumulated depreciation of $1,140,249 and\xa0\xa0$939,408 respectively', '51,462', '451,843']
['Deposits', '5,923', '5,923']
['Other assets', '1,545', '1,545']
['Total Assets', '$', '177,373', '$', '807,936']
['', '', '']
["Liabilities and Stockholders' Deficit", '', '']
['Current Liabilities:', '', '']
['Accounts payable and accrued liabilities', '$', '1,043,088', '$', '840,009']
['Notes payable, current portion', '962,810', '472,017']
['Capital lease payable, current portion', '886,356', '660,458']
['Note payable, related party', '1,029,005', '479,578']
['Deferred revenue', '85,407', '74,824']
['Convertible notes payable, net of discount', '115,632', '197,645']
['Derivative liability - warrants', '83,766', '302,065']
['Derivative liability - embedded conversion option', '346,734', '469,632']
['Total Current Liabilities', '4,552,798', '3,496,228']
['', '', '']
['Capital lease payable, long term portion', '517,686', '1,143,501']
['Total Liabilities', '5,070,484', '4,639,729']
['', '', '']
['Commitments and Contingencies (Note 14)', '', '']
['', '', '']
["Stockholders' Deficit:", '', '']
['Series B convertible preferred stock ($.001 par value; 10,000,000 shares authorized; 626,667 shares issued and outstanding)', '626', '626']
['Series AA convertible preferred stock ($.001 par value; 10,000,000 shares authorized; 0 and 400,000 shares issued and outstanding, respectively)', '-', '400']
['Common stock ($.001 par value; 6,000,000,000 shares authorized; 912,466,204 and 1,742,940 shares issued and\xa0\xa0outstanding, respectively)', '912,466', '1,743']
['Additional paid in capital', '48,984,686', '49,075,659']
['Accumulated deficit', '(54,696,891', ')', '(52,816,224', ')']
['Accumulated other comprehensive loss', '(80,998', ')', '(80,997', ')']
['Treasury stock, at cost, (406 shares)', '(13,000', ')', '(13,000', ')']
["Total Stockholders' Deficit", '(4,893,111', ')', '(3,831,793', ')']
["Total Liabilities and Stockholders' Deficit", '$', '177,373', '$', '807,936']