Question

我试图从etfdailynews.com的几百只ETF获得股票代码。我首先从https://etfdailynews.com/etfs/获取类别名称列表，然后将类别连接到该URL以打开包含ETF名称和符号的页面。例如，https://etfdailynews.com/etfs/technology-equities-etfs/

在页面上，标题为＆＃34;基金符号/名称＆＃34;有符号，然后在下面命名。计划是阅读表格，然后假设符号和名称之间有一些\ n，分割得只是符号。例如，获得前10：

sector_table = pd.read_html("https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs")
etf_list = list(sector[0]["Fund Symbol/Name"].iloc[0:10])

问题是它返回的名称和符号之间没有任何空格。由于某些符号有时是3个字符，有时是4个字符长，因此我无法执行简单的拼接。上面返回的列表示例：

[＆＃39; SPYSPDR S＆amp; P 500＆＃39; IVViShares Core S＆amp; P 500 ETF＆＃39;＆＃39; VTIVanguard 股票市场ETF＆＃39;，＆＃39; VOOVanguard S＆amp; P 500 ETF＆＃39; VIGVanguard Div 升值ETF - DNQ＆＃39;，＆＃39; IWBiShares Russell 1000 ETF＆＃39;，＆＃39; RSPGuggenheim S＆amp; P 500等权ETF＆＃39; USMViShares Edge MSCI Min Vol USA ETF＆＃39; ITOTiShares Core S＆amp; P美国股票市场ETF＆＃39;，＆＃39; SCHXSchwab美国大型ETF＆＃39;]

也许有一种方法可以按照我的想法与beautifulsoup一起做，但我不熟悉那个模块，据我所知pd.read_html更适合使用表格，但我可能完全错了。

编辑：我应该澄清一下，我打算打开ETF的URL以提取代码。我曾计划将ETF符号连接到URL。允许我简单地提取ETF的URL的替代方案也是完美的。

Answer 1

该函数通过在
标签上添加一个分号并拆分文本来解析换行符下面的单元格。

（HTML截至3/18/18 https://etfdailynews.com/etfs/Large-Cap-Blend-ETFs/）

html <td class="bold"><a class="show" href="/etf/SPY/">SPY<br/> <span class="thirteen unbold">SPDR S&P 500</span></a></td>

使用urllib或者请求打开url后，将html表传递给下面的函数，它将返回一个DataFrame。

def parse_etf_html(data_table, debug=False):



    header = [th.text for th in data_table.findAll('th')]

    # header modifications
    compound_field = header.pop(0)

    header.insert(0, compound_field.split('/')[0])
    header.insert(1, 'Fund ' + compound_field.split('/')[1])

    compound_field = header.pop(8)
    header.insert(8, compound_field + '_cur')
    header.insert(9, compound_field + '_per')


    # Row 0 is the table header

    extracted_data = list()

    # Starting at row 1, loop each table row 

    for tr in data_table.findAll('tr')[1:]:
        extracted_row = list()

        if debug:
            # simple test to verify if number of items matches expectations.
            row_parsing_log = dict()

        for td in tr.findAll('td'):

            #<td class="bold"><a class="show" href="/etf/SPY/">SPY<br/><span class="thirteen unbold">SPDR S&amp;P 500</span></a></td>,
            if td.find('a') and td.find('br') and td.find('span'):
                td.br.string=";"
                extracted_row.extend(td.text.split(";"))

                if debug:
                    row_parsing_log['symbol_fund_as_expected'] = len(td.text.split(";")) == 2

            # <td class="grade-4">+0.25<br/>(0.09%)</td>
            elif td.find('br') and [td.find('strong'),td.find('small'), td.find('a')]  == [None, None, None]:
                td.br.string=";"
                # percent change is enclosed with ().  remove to avoid confusion 
                extracted_row.append(td.text.split(";")[0])
                extracted_row.append(td.text.split(";")[1].replace("(", "").replace(")", ""))

                if debug:
                    row_parsing_log['day_chg_as_expected'] = len(td.text.split(";")) == 2

            #<td class="text-center grade-1"> <strong>A</strong><br/> <small>Strong Buy</small> </td>
            elif td.find('br') and td.find('strong') and td.find('small'):
                # Appears to be parsed correctly by pandas read html
                extracted_row.append(td.text.replace('\n', ' ').strip())

            else:
                extracted_row.append(td.text)


        record = dict(zip(header, extracted_row))

        if debug:
            record.update(row_parsing_log)

        # append each row
        extracted_data.append(record)


    if debug:
        header.extend(['symbol_fund_as_expected', 'day_chg_as_expected'])

    outputDF = pd.DataFrame(extracted_data)[header]

    # data types


    return outputDF

链接到静态笔记本： https://github.com/emican86/49350586/blob/master/read_etf_html_tables.ipynb

链接到Azure笔记本（您可以克隆并用作实时演示）： https://notebooks.azure.com/emican86/libraries/read-etf-html-tables

难以提取股票行情 - pd.read_html不保留空白

1 个答案: