beautifulsoup在嵌套类中展开表

时间:2017-02-12 03:53:36

标签: python pandas bs4

我正在尝试使用pandas.read_html来解析一些表,但我注意到我检索到的HTML在每个tr中都有嵌套类。

link here:数据实际上以json格式存储。所以我解析它以提取html code

我已经彻底缩短了HTML,但我希望我仍然能够捕捉到我的意思以及我想要实现的目标。

<div class='\"tab_content' id='\"tab-profitability\"' style='\"display:block;\"' tab_override="">
    <table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
        <colgroup>
            <col width='\"23%\"'></col>
            <col span='\"11\"' width='\"7%\"'></col>
        </colgroup>
        <thead>
            <tr>
                <th align='\"left\"' class='\"str' id='\"pr-margins\"' scope='\"col\"' text2="">Margins % of Sales</th>
                <th align='\"right\"' id='\"pr-Y0\"' scope='\"col\"'>2006-12</th>
                <th align='\"right\"' id='\"pr-Y1\"' scope='\"col\"'>2007-12</th>
                <th align='\"right\"' id='\"pr-Y2\"' scope='\"col\"'>2008-12</th>
            </tr>
        </thead>
        <tbody>
            <tr class='\"hr\"'>
                <td colspan='\"12\"'></td>
            </tr>
            <tr>
                <th class='\"row_lbl\"' id='\"i12\"' scope='\"row\"'>Revenue</th>
                <td align='\"right\"' headers='\"pr-Y0' i12="" pr-margins="">100.00</td>
                <td align='\"right\"' headers='\"pr-Y1' i12="" pr-margins="">100.00</td>
                <td align='\"right\"' headers='\"pr-Y2' i12="" pr-margins="">100.00</td>
            </tr>
            <tr class='\"hr\"'>
                <td colspan='\"12\"'></td>
            </tr>
            <tr>
                <th class='\"row_lbl\"' id='\"i13\"' scope='\"row\"'>COGS</th>
                <td align='\"right\"' headers='\"pr-Y0' i13="" pr-margins="">49.55</td>
                <td align='\"right\"' headers='\"pr-Y1' i13="" pr-margins="">55.63</td>
                <td align='\"right\"' headers='\"pr-Y2' i13="" pr-margins="">69.97</td>
            </tr>
            <tr class='\"hr\"'>
                <td colspan='\"12\"'>
                    <div class='\"hspacer2\"'>
                        <table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
                            <colgroup>
                                <col width='\"23%\"'></col>
                                <col span='\"11\"' width='\"7%\"'></col>
                            </colgroup>
                            <thead>
                                <tr>
                                    <th align='\"left\"' class='\"str' id='\"pr-profit\"' scope='\"col\"' text2="">Profitability</th>
                                    <th align='\"right\"' id='\"pr-pro-Y0\"' scope='\"col\"'>2006-12</th>
                                    <th align='\"right\"' id='\"pr-pro-Y1\"' scope='\"col\"'>2007-12</th>
                                    <th align='\"right\"' id='\"pr-pro-Y2\"' scope='\"col\"'>2008-12</th>
                                </tr>
                            </thead>
                            <tbody>
                                <tr class='\"hr\"'>
                                    <td colspan='\"12\"'></td>
                                </tr>
                                <tr>
                                    <th class='\"row_lbl\"' id='\"i21\"' scope='\"row\"'>Tax Rate %</th>
                                    <td align='\"right\"' headers='\"pr-pro-Y0' i21="" pr-profit="">22.17</td>
                                    <td align='\"right\"' headers='\"pr-pro-Y1' i21="" pr-profit="">5.29</td>
                                    <td align='\"right\"' headers='\"pr-pro-Y2' i21="" pr-profit="">11.59</td>
                                </tr>
                                <tr class='\"hr\"'>
                                    <td colspan='\"12\"'></td>
                                </tr>
                                <tr>
                                    <th class='\"row_lbl\"' id='\"i22\"' scope='\"row\"'>Net Margin %</th>
                                    <td align='\"right\"' headers='\"pr-pro-Y0' i22="" pr-profit="">13.06</td>
                                    <td align='\"right\"' headers='\"pr-pro-Y1' i22="" pr-profit="">17.09</td>
                                    <td align='\"right\"' headers='\"pr-pro-Y2' i22="" pr-profit="">10.65</td>
                                </tr>
                                <tr class='\"hr\"'>
                                    <td colspan='\"12\"'>
                                        <div class='\"tab_content' id='\"tab-growth\"' style='\"display:none;\"' tab_override="">
                                            <table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
                                                <colgroup>
                                                    <col width='\"23%\"'></col>
                                                    <col span='\"11\"' width='\"7%\"'></col>
                                                </colgroup>
                                                <thead>
                                                    <tr>
                                                        <th></th>
                                                        <th align='\"right\"' id='\"gr-Y0\"' scope='\"col\"'>2006-12</th>
                                                        <th align='\"right\"' id='\"gr-Y1\"' scope='\"col\"'>2007-12</th>
                                                        <th align='\"right\"' id='\"gr-Y2\"' scope='\"col\"'>2008-12</th>
                                                    </tr>
                                                </thead>
                                                <tbody>
                                                    <tr class='\"hr\"'>
                                                        <td colspan='\"12\"'></td>
                                                    </tr>
                                                    <tr>
                                                        <th align='\"left\"' class='\"str' colspan='\"12\"' id='\"gr-revenue\"' scope='\"row\"' text2="">Revenue %</th>
                                                    </tr>
                                                    <tr class='\"hr\"'>
                                                        <td colspan='\"12\"'></td>
                                                    </tr>
                                                    <tr>
                                                        <th class='\"row_lbl\"' id='\"i28\"' scope='\"row\"'>Year over Year</th>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y0' i28="">—</td>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y1' i28="">48.48</td>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y2' i28="">187.48</td>
                                                    </tr>
                                                    <tr class='\"hr\"'>
                                                        <td colspan='\"12\"'></td>
                                                    </tr>
                                                    <tr>
                                                        <th class='\"row_lbl\"' id='\"i29\"' scope='\"row\"'>3-Year Average</th>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y0' i29="">—</td>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y1' i29="">10.04</td>
                                                        <td align='\"right\"' gr-revenue="" headers='\"gr-Y2' i29="">61.51</td>
                                                    </tr>

如何解开html并将其解析为pandas?

我注意到最后一个tr每个都有一个class: "r_table1"名称。我已经尝试了下面的代码,看看我是否可以打开它,但它不起作用。

r = reqiest.get(r'url_link')
initial_html = bs4.BeautifulSoup(r.text, 'lxml')
for each_class in initial_html.findAll(attrs={'class': 'r_table1'}):
    each_class.unwrap()  

df = pandas.read_html(str(initial_html), flavor='lxml')  # error message: lxml.etree.XMLSyntaxError: Unexpected end tag : col, line 1, column 886

1 个答案:

答案 0 :(得分:1)

试试这个:

import pandas as pd
import requests
import json

url = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&t=XHKG:02888&region=hkg&culture=en-US&cur=&order=asc'

r = requests.get(url)
# let's create a valid HTML document - add `<html>`, `</html>` tags
body = '{}{}{}'.format('<html>', json.loads(r.text)['componentData'], '</html>')
dfs = pd.read_html(body)

for df in dfs:
    print(df)
    # print line separator so we can visually distinguish different DFs
    print('-'*80)

输出:

In [31]: for df in dfs:
    ...:     print(df)
    ...:     print('-'*80)
    ...:
    Margins % of Sales 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12     TTM
0              Revenue  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00
1                 COGS       —       —       —       —       —       —       —       —       —       —       —
2         Gross Margin       —       —       —       —       —       —       —       —       —       —       —
3                 SG&A   14.65   12.90   13.53   13.82   12.51   10.86   10.61   10.47   16.72   24.81   27.64
4                  R&D       —       —       —       —       —       —       —       —       —       —       —
5                Other  -14.65  -12.90  -13.53  -13.82  -12.51  -10.86  -10.61  -10.47  -16.72  -24.81  -27.64
6     Operating Margin   39.77   39.15   36.12   39.07   40.33   40.32   38.95   35.34   26.15  -14.77  -30.88
7  Net Int Inc & Other       —       —       —       —       —       —       —       —       —       —       —
8           EBT Margin  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00
--------------------------------------------------------------------------------
                  Profitability 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12     TTM
0                    Tax Rate %   25.93   25.92   26.87   32.50   27.90   27.19   27.50   30.74   36.13       —       —
1                  Net Margin %   28.51   27.57   26.95   25.64   28.54   28.99   27.68   23.83   16.14  -21.27  -36.92
2      Asset Turnover (Average)    0.03    0.03    0.03    0.03    0.03    0.03    0.03    0.03    0.02    0.02    0.01
3            Return on Assets %    0.95    0.95    0.89    0.78    0.91    0.87    0.79    0.62    0.37   -0.32   -0.48
4  Financial Leverage (Average)   15.79   15.79   19.65   15.97   13.52   14.71   14.03   14.58   15.63   13.29   13.63
5            Return on Equity %   15.86   15.07   15.85   13.66   13.22   12.29   11.36    8.93    5.64   -4.64   -6.69
6  Return on Invested Capital %       —       —       —       —       —       —       —       —       —       —       —
7             Interest Coverage       —       —       —       —       —       —       —       —       —       —       —
--------------------------------------------------------------------------------
            Unnamed: 0 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12  2013-12  2014-12 2015-12 Latest Qtr
0            Revenue %     NaN     NaN     NaN     NaN     NaN     NaN     NaN      NaN      NaN     NaN        NaN
1       Year over Year       —   28.97   22.71    4.25   15.13   10.21    5.54    -2.81    -5.64  -36.31          —
2       3-Year Average       —   24.18   22.61   18.16   13.78    9.77   10.23     4.17    -1.08  -16.41          —
3       5-Year Average       —       —   21.69   19.62   17.21   15.92   11.37     6.29     4.20   -7.44          —
4      10-Year Average       —       —       —       —       —       —       —    13.73    11.64    4.16          —
5   Operating Income %     NaN     NaN     NaN     NaN     NaN     NaN     NaN      NaN      NaN     NaN        NaN
6       Year over Year       —   26.94   13.24   12.76   18.85   10.67   -0.22   -10.30   -30.16       —          —
7       3-Year Average       —   21.47   19.44   17.47   14.92   14.04    9.48    -0.32   -14.50       —          —
8       5-Year Average       —   38.77   24.13   18.01   17.96   16.35   10.88     5.83    -3.84       —          —
9      10-Year Average       —   10.44   14.62   20.17   20.75   20.06   24.04    14.62     6.52       —          —
10        Net Income %     NaN     NaN     NaN     NaN     NaN     NaN     NaN      NaN      NaN     NaN        NaN
11      Year over Year       —   24.71   19.96   -0.82   28.17   11.93    0.78   -16.31   -36.11       —          —
12      3-Year Average       —   21.65   20.54   14.06   15.10   12.47   13.08    -1.90   -18.62       —          —
13      5-Year Average       —   37.31   25.86   16.46   17.36   16.31   11.46     3.72    -5.02       —          —
14     10-Year Average       —   11.40   16.06   19.77   15.66   21.38   23.71    14.25     5.17       —          —
15               EPS %     NaN     NaN     NaN     NaN     NaN     NaN     NaN      NaN      NaN     NaN        NaN
16      Year over Year       —   18.98    9.87  -13.50   21.19    2.69   -0.25   -17.55   -37.67       —          —
17      3-Year Average       —  -75.02   14.12    4.18    4.82    2.49    7.48    -5.48   -19.97       —          —
18      5-Year Average       —   41.12   21.93  -56.93    9.27    7.07    3.36    -2.41    -8.60       —          —
19     10-Year Average       —    7.80   11.45   14.14    9.87   15.55   20.78     9.09   -37.26       —          —
--------------------------------------------------------------------------------
                   Cash Flow Ratios 2006-12 2007-12 2008-12 2009-12  2010-12 2011-12  2012-12 2013-12 2014-12  2015-12     TTM
0  Operating Cash Flow Growth % YOY       —       —       —       —        —       —  -267.00       —       —        —       —
1       Free Cash Flow Growth % YOY       —       —       —       —        —       —  -206.00       —       —        —       —
2            Cap Ex as a % of Sales    3.07    4.57   11.31    1.98     2.44    1.72     0.96    1.19    1.17     1.26    1.52
3            Free Cash Flow/Sales %  102.33  180.91  176.32  -25.15  -112.03  108.82   100.71   53.03  323.44  -286.34  -93.74
4         Free Cash Flow/Net Income    3.59    6.56    6.88   -0.98    -3.93    3.73     3.62    2.22   20.04    13.46    2.59
--------------------------------------------------------------------------------
       Balance Sheet Items (in %) 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 Latest Qtr
0   Cash & Short-Term Investments    2.89    3.09    5.55    4.15    6.34    7.91    9.59    8.09   13.40   10.20      10.01
1             Accounts Receivable       —       —       —       —       —       —       —       —       —       —          —
2                       Inventory       —       —       —       —       —       —       —       —       —       —          —
3            Other Current Assets       —       —       —       —       —       —       —       —       —       —          —
4            Total Current Assets       —       —       —       —       —       —       —       —       —       —          —
5                        Net PP&E    0.81    0.88    0.82    0.94    0.87    0.85    1.04    1.02    1.10    1.13       1.13
6                     Intangibles    2.31    1.94    1.46    1.52    1.35    1.18    1.15    0.90    0.71    0.72       0.73
7          Other Long-Term Assets       —       —       —       —       —       —       —       —       —       —          —
8                    Total Assets  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00     100.00
9                Accounts Payable    0.03    0.06    0.12    0.18    0.19    0.17    0.17    0.16    0.12    0.12       1.06
10                Short-Term Debt       —       —       —       —       —       —       —       —       —       —          —
11                  Taxes Payable    0.03    0.06    0.12    0.18    0.19    0.17    0.17    0.16    0.12    0.12       1.06
12            Accrued Liabilities       —       —       —       —       —       —       —       —       —       —          —
13   Other Short-Term Liabilities       —       —       —       —       —       —       —       —       —       —          —
14      Total Current Liabilities       —       —       —       —       —       —       —       —       —       —          —
15                 Long-Term Debt       —       —       —       —       —       —       —       —       —       —          —
16    Other Long-Term Liabilities       —       —       —       —       —       —       —       —       —       —          —
17              Total Liabilities   93.67   93.67   94.91   93.74   92.60   93.20   92.87   93.14   93.60   92.48          —
18     Total Stockholders' Equity    6.33    6.33    5.09    6.26    7.40    6.80    7.13    6.86    6.40    7.52     100.00
19     Total Liabilities & Equity  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00  100.00     100.00
--------------------------------------------------------------------------------
  Liquidity/Financial Health 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 Latest Qtr
0              Current Ratio       —       —       —       —       —       —       —       —       —       —          —
1                Quick Ratio       —       —       —       —       —       —       —       —       —       —          —
2         Financial Leverage   15.79   15.79   19.65   15.97   13.52   14.71   14.03   14.58   15.63   13.29      13.63
3                Debt/Equity       —       —       —       —       —       —       —       —       —       —          —
--------------------------------------------------------------------------------
               Efficiency 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12   TTM
0  Days Sales Outstanding       —       —       —       —       —       —       —       —       —       —     —
1          Days Inventory       —       —       —       —       —       —       —       —       —       —     —
2         Payables Period       —       —       —       —       —       —       —       —       —       —     —
3   Cash Conversion Cycle       —       —       —       —       —       —       —       —       —       —     —
4    Receivables Turnover       —       —       —       —       —       —       —       —       —       —     —
5      Inventory Turnover       —       —       —       —       —       —       —       —       —       —     —
6   Fixed Assets Turnover    4.19    4.08    3.91    3.43    3.53    3.49    3.01    2.53    2.18    1.36  1.16
7          Asset Turnover    0.03    0.03    0.03    0.03    0.03    0.03    0.03    0.03    0.02    0.02  0.01
--------------------------------------------------------------------------------