如何识别分隔表中每一列的空白`td`

时间:2019-06-01 09:09:41

标签: python web-scraping beautifulsoup scrapy

我正在从{strong> 合并资产负债表 表中的sec uoip_10k抓取数据。每列用空白数据隔开1或2 td。有没有办法识别这些空白td

当前,我正在做的是下面的事情。

def check_if_cell_seperator(cell):
    if 'width' in str(cell):
        width = int(cell["width"].strip('%').strip())
        if width < 2 and cell.text.strip() == '':
            return True
        else:
            return False
    else:
        return False

def main(url):
    htmlpage = urllib.request.urlopen(url)
    page = BeautifulSoup(htmlpage, "html.parser")
    all_divtables = page.find_all('table')
    # only taking data from 38th table
    for i,table in enumerate(all_divtables[38:39]):
        rows = table.find_all(['th', 'td'],recursive=False)
        table_data = []
        for tr in rows:
            row_data=[]
            cells = tr.find_all('td')
            for cell in cells:
                if check_if_cell_seperator(cell):
                    continue
                else:
                    cell_data = cell.text
                row_data.append(cell_data.encode('utf-8'))    
        table_data.append([x.decode('utf-8').strip() for x in row_data]) 
    print(table_data)      

但是这里的问题是从未指定单元格width的行中抓取数据(例如,标题行)。

我有什么方法可以识别和删除td,它们仅存在于单独的列中。

  

无法删除最终列表中的所有空格,因为这会影响缩进。

     

示例

    <tr>
    <td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt">&nbsp; </font></td>
    <td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">&nbsp;</font></td>
    <td colspan="2" valign="bottom" style="BORDER-BOTTOM: black 2px solid">
        <div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">2015</font></div>
    </td>
    <td nowrap="" valign="bottom" style="TEXT-ALIGN: left; PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">&nbsp;</font></td>
    <td valign="bottom" style="PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">&nbsp;</font></td>
    <td colspan="2" valign="bottom" style="BORDER-BOTTOM: black 2px solid">
        <div style="TEXT-INDENT: 0pt; DISPLAY: block; MARGIN-LEFT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 8pt; FONT-WEIGHT: bold">2014</font></div>
    </td>
    <td nowrap="" valign="bottom" style="TEXT-ALIGN: left; PADDING-BOTTOM: 2px"><font style="DISPLAY: inline; FONT-FAMILY: times new roman; FONT-SIZE: 10pt; FONT-WEIGHT: bold">&nbsp;</font></td>
    </tr> 

在上面的示例中,td的2,4,5是空白tds,仅用于分隔列。

请帮助。

1 个答案:

答案 0 :(得分:1)

下面的代码(在python 3.6下测试)跳过空单元格和空行。它还会跳过表头。

如果代码适合您,则可以删除调试打印。

import requests

from bs4 import BeautifulSoup


def main(url):
    def _is_separator_cell(cell):
        width = cell.attrs.get('width', None)
        if width:
            _width = int(cell.attrs.get('width')[:-1])
            return _width <= 2
        else:
            return False


    htmlpage = requests.get(url).content
    page = BeautifulSoup(htmlpage, "html.parser")
    all_divtables = page.find_all('table')
    # only taking data from 38th table
    for i, table in enumerate(all_divtables[38:39]):
        rows = table.find_all('tr', recursive=False)
        table_data = []
        for r, tr in enumerate(rows):
            row_data = []
            print('DBG {}.'.format(r))
            cells = tr.find_all('td')
            is_header = len(cells) < 8
            for c, cell in enumerate(cells):
                data = cell.text.strip()
                separator_cell = _is_separator_cell(cell)
                print('\tDBG {}. [{}] (width: {})'.format(c, data, cell.attrs.get('width')))
                if data or (not separator_cell and not is_header):
                    row_data.append(data)
            if row_data:
                table_data.append(row_data)
    return table_data


table_data = main('https://www.sec.gov/Archives/edgar/data/1097718/000135448815004617/uoip_10k.htm')
print('results:')
for row in table_data:
    print(row)

结果:

['June 30,', 'June 30,']
['2015', '2014']
['Assets']
['Current Assets:']
['Cash', '$', '21,745', '$', '56,827']
['Accounts receivable, net', '19,945', '84,091']
['Inventory', '-', '19,069']
['Prepaid expenses', '66,543', '136,927']
['Marketable securities', '2', '3']
['Other current assets', '10,208', '51,708']
['Total Current Assets', '118,443', '348,625']
['', '', '']
['Property and equipment, net of accumulated depreciation of $1,140,249 and\xa0\xa0$939,408 respectively', '51,462', '451,843']
['Deposits', '5,923', '5,923']
['Other assets', '1,545', '1,545']
['Total Assets', '$', '177,373', '$', '807,936']
['', '', '']
["Liabilities and Stockholders' Deficit", '', '']
['Current Liabilities:', '', '']
['Accounts payable and accrued liabilities', '$', '1,043,088', '$', '840,009']
['Notes payable, current portion', '962,810', '472,017']
['Capital lease payable, current portion', '886,356', '660,458']
['Note payable, related party', '1,029,005', '479,578']
['Deferred revenue', '85,407', '74,824']
['Convertible notes payable, net of discount', '115,632', '197,645']
['Derivative liability - warrants', '83,766', '302,065']
['Derivative liability - embedded conversion option', '346,734', '469,632']
['Total Current Liabilities', '4,552,798', '3,496,228']
['', '', '']
['Capital lease payable, long term portion', '517,686', '1,143,501']
['Total Liabilities', '5,070,484', '4,639,729']
['', '', '']
['Commitments and Contingencies (Note 14)', '', '']
['', '', '']
["Stockholders' Deficit:", '', '']
['Series B convertible preferred stock ($.001 par value; 10,000,000 shares authorized; 626,667 shares issued and outstanding)', '626', '626']
['Series AA convertible preferred stock ($.001 par value; 10,000,000 shares authorized; 0 and 400,000 shares issued and outstanding, respectively)', '-', '400']
['Common stock ($.001 par value; 6,000,000,000 shares authorized; 912,466,204 and 1,742,940 shares issued and\xa0\xa0outstanding, respectively)', '912,466', '1,743']
['Additional paid in capital', '48,984,686', '49,075,659']
['Accumulated deficit', '(54,696,891', ')', '(52,816,224', ')']
['Accumulated other comprehensive loss', '(80,998', ')', '(80,997', ')']
['Treasury stock, at cost, (406 shares)', '(13,000', ')', '(13,000', ')']
["Total Stockholders' Deficit", '(4,893,111', ')', '(3,831,793', ')']
["Total Liabilities and Stockholders' Deficit", '$', '177,373', '$', '807,936']