我正在尝试使用pandas.read_html
来解析一些表,但我注意到我检索到的HTML在每个tr
中都有嵌套类。
link here:数据实际上以json
格式存储。所以我解析它以提取html code
。
我已经彻底缩短了HTML
,但我希望我仍然能够捕捉到我的意思以及我想要实现的目标。
<div class='\"tab_content' id='\"tab-profitability\"' style='\"display:block;\"' tab_override="">
<table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
<colgroup>
<col width='\"23%\"'></col>
<col span='\"11\"' width='\"7%\"'></col>
</colgroup>
<thead>
<tr>
<th align='\"left\"' class='\"str' id='\"pr-margins\"' scope='\"col\"' text2="">Margins % of Sales</th>
<th align='\"right\"' id='\"pr-Y0\"' scope='\"col\"'>2006-12</th>
<th align='\"right\"' id='\"pr-Y1\"' scope='\"col\"'>2007-12</th>
<th align='\"right\"' id='\"pr-Y2\"' scope='\"col\"'>2008-12</th>
</tr>
</thead>
<tbody>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i12\"' scope='\"row\"'>Revenue</th>
<td align='\"right\"' headers='\"pr-Y0' i12="" pr-margins="">100.00</td>
<td align='\"right\"' headers='\"pr-Y1' i12="" pr-margins="">100.00</td>
<td align='\"right\"' headers='\"pr-Y2' i12="" pr-margins="">100.00</td>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i13\"' scope='\"row\"'>COGS</th>
<td align='\"right\"' headers='\"pr-Y0' i13="" pr-margins="">49.55</td>
<td align='\"right\"' headers='\"pr-Y1' i13="" pr-margins="">55.63</td>
<td align='\"right\"' headers='\"pr-Y2' i13="" pr-margins="">69.97</td>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'>
<div class='\"hspacer2\"'>
<table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
<colgroup>
<col width='\"23%\"'></col>
<col span='\"11\"' width='\"7%\"'></col>
</colgroup>
<thead>
<tr>
<th align='\"left\"' class='\"str' id='\"pr-profit\"' scope='\"col\"' text2="">Profitability</th>
<th align='\"right\"' id='\"pr-pro-Y0\"' scope='\"col\"'>2006-12</th>
<th align='\"right\"' id='\"pr-pro-Y1\"' scope='\"col\"'>2007-12</th>
<th align='\"right\"' id='\"pr-pro-Y2\"' scope='\"col\"'>2008-12</th>
</tr>
</thead>
<tbody>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i21\"' scope='\"row\"'>Tax Rate %</th>
<td align='\"right\"' headers='\"pr-pro-Y0' i21="" pr-profit="">22.17</td>
<td align='\"right\"' headers='\"pr-pro-Y1' i21="" pr-profit="">5.29</td>
<td align='\"right\"' headers='\"pr-pro-Y2' i21="" pr-profit="">11.59</td>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i22\"' scope='\"row\"'>Net Margin %</th>
<td align='\"right\"' headers='\"pr-pro-Y0' i22="" pr-profit="">13.06</td>
<td align='\"right\"' headers='\"pr-pro-Y1' i22="" pr-profit="">17.09</td>
<td align='\"right\"' headers='\"pr-pro-Y2' i22="" pr-profit="">10.65</td>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'>
<div class='\"tab_content' id='\"tab-growth\"' style='\"display:none;\"' tab_override="">
<table cellpadding='\"0\"' cellspacing='\"0\"' class='\"r_table1' print97="" style='\"border-top:none;\"' text2="">
<colgroup>
<col width='\"23%\"'></col>
<col span='\"11\"' width='\"7%\"'></col>
</colgroup>
<thead>
<tr>
<th></th>
<th align='\"right\"' id='\"gr-Y0\"' scope='\"col\"'>2006-12</th>
<th align='\"right\"' id='\"gr-Y1\"' scope='\"col\"'>2007-12</th>
<th align='\"right\"' id='\"gr-Y2\"' scope='\"col\"'>2008-12</th>
</tr>
</thead>
<tbody>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th align='\"left\"' class='\"str' colspan='\"12\"' id='\"gr-revenue\"' scope='\"row\"' text2="">Revenue %</th>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i28\"' scope='\"row\"'>Year over Year</th>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y0' i28="">—</td>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y1' i28="">48.48</td>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y2' i28="">187.48</td>
</tr>
<tr class='\"hr\"'>
<td colspan='\"12\"'></td>
</tr>
<tr>
<th class='\"row_lbl\"' id='\"i29\"' scope='\"row\"'>3-Year Average</th>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y0' i29="">—</td>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y1' i29="">10.04</td>
<td align='\"right\"' gr-revenue="" headers='\"gr-Y2' i29="">61.51</td>
</tr>
如何解开html并将其解析为pandas?
我注意到最后一个tr
每个都有一个class: "r_table1"
名称。我已经尝试了下面的代码,看看我是否可以打开它,但它不起作用。
r = reqiest.get(r'url_link')
initial_html = bs4.BeautifulSoup(r.text, 'lxml')
for each_class in initial_html.findAll(attrs={'class': 'r_table1'}):
each_class.unwrap()
df = pandas.read_html(str(initial_html), flavor='lxml') # error message: lxml.etree.XMLSyntaxError: Unexpected end tag : col, line 1, column 886
答案 0 :(得分:1)
试试这个:
import pandas as pd
import requests
import json
url = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&t=XHKG:02888®ion=hkg&culture=en-US&cur=&order=asc'
r = requests.get(url)
# let's create a valid HTML document - add `<html>`, `</html>` tags
body = '{}{}{}'.format('<html>', json.loads(r.text)['componentData'], '</html>')
dfs = pd.read_html(body)
for df in dfs:
print(df)
# print line separator so we can visually distinguish different DFs
print('-'*80)
输出:
In [31]: for df in dfs:
...: print(df)
...: print('-'*80)
...:
Margins % of Sales 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 TTM
0 Revenue 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
1 COGS — — — — — — — — — — —
2 Gross Margin — — — — — — — — — — —
3 SG&A 14.65 12.90 13.53 13.82 12.51 10.86 10.61 10.47 16.72 24.81 27.64
4 R&D — — — — — — — — — — —
5 Other -14.65 -12.90 -13.53 -13.82 -12.51 -10.86 -10.61 -10.47 -16.72 -24.81 -27.64
6 Operating Margin 39.77 39.15 36.12 39.07 40.33 40.32 38.95 35.34 26.15 -14.77 -30.88
7 Net Int Inc & Other — — — — — — — — — — —
8 EBT Margin 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
--------------------------------------------------------------------------------
Profitability 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 TTM
0 Tax Rate % 25.93 25.92 26.87 32.50 27.90 27.19 27.50 30.74 36.13 — —
1 Net Margin % 28.51 27.57 26.95 25.64 28.54 28.99 27.68 23.83 16.14 -21.27 -36.92
2 Asset Turnover (Average) 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.01
3 Return on Assets % 0.95 0.95 0.89 0.78 0.91 0.87 0.79 0.62 0.37 -0.32 -0.48
4 Financial Leverage (Average) 15.79 15.79 19.65 15.97 13.52 14.71 14.03 14.58 15.63 13.29 13.63
5 Return on Equity % 15.86 15.07 15.85 13.66 13.22 12.29 11.36 8.93 5.64 -4.64 -6.69
6 Return on Invested Capital % — — — — — — — — — — —
7 Interest Coverage — — — — — — — — — — —
--------------------------------------------------------------------------------
Unnamed: 0 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 Latest Qtr
0 Revenue % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 Year over Year — 28.97 22.71 4.25 15.13 10.21 5.54 -2.81 -5.64 -36.31 —
2 3-Year Average — 24.18 22.61 18.16 13.78 9.77 10.23 4.17 -1.08 -16.41 —
3 5-Year Average — — 21.69 19.62 17.21 15.92 11.37 6.29 4.20 -7.44 —
4 10-Year Average — — — — — — — 13.73 11.64 4.16 —
5 Operating Income % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 Year over Year — 26.94 13.24 12.76 18.85 10.67 -0.22 -10.30 -30.16 — —
7 3-Year Average — 21.47 19.44 17.47 14.92 14.04 9.48 -0.32 -14.50 — —
8 5-Year Average — 38.77 24.13 18.01 17.96 16.35 10.88 5.83 -3.84 — —
9 10-Year Average — 10.44 14.62 20.17 20.75 20.06 24.04 14.62 6.52 — —
10 Net Income % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 Year over Year — 24.71 19.96 -0.82 28.17 11.93 0.78 -16.31 -36.11 — —
12 3-Year Average — 21.65 20.54 14.06 15.10 12.47 13.08 -1.90 -18.62 — —
13 5-Year Average — 37.31 25.86 16.46 17.36 16.31 11.46 3.72 -5.02 — —
14 10-Year Average — 11.40 16.06 19.77 15.66 21.38 23.71 14.25 5.17 — —
15 EPS % NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
16 Year over Year — 18.98 9.87 -13.50 21.19 2.69 -0.25 -17.55 -37.67 — —
17 3-Year Average — -75.02 14.12 4.18 4.82 2.49 7.48 -5.48 -19.97 — —
18 5-Year Average — 41.12 21.93 -56.93 9.27 7.07 3.36 -2.41 -8.60 — —
19 10-Year Average — 7.80 11.45 14.14 9.87 15.55 20.78 9.09 -37.26 — —
--------------------------------------------------------------------------------
Cash Flow Ratios 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 TTM
0 Operating Cash Flow Growth % YOY — — — — — — -267.00 — — — —
1 Free Cash Flow Growth % YOY — — — — — — -206.00 — — — —
2 Cap Ex as a % of Sales 3.07 4.57 11.31 1.98 2.44 1.72 0.96 1.19 1.17 1.26 1.52
3 Free Cash Flow/Sales % 102.33 180.91 176.32 -25.15 -112.03 108.82 100.71 53.03 323.44 -286.34 -93.74
4 Free Cash Flow/Net Income 3.59 6.56 6.88 -0.98 -3.93 3.73 3.62 2.22 20.04 13.46 2.59
--------------------------------------------------------------------------------
Balance Sheet Items (in %) 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 Latest Qtr
0 Cash & Short-Term Investments 2.89 3.09 5.55 4.15 6.34 7.91 9.59 8.09 13.40 10.20 10.01
1 Accounts Receivable — — — — — — — — — — —
2 Inventory — — — — — — — — — — —
3 Other Current Assets — — — — — — — — — — —
4 Total Current Assets — — — — — — — — — — —
5 Net PP&E 0.81 0.88 0.82 0.94 0.87 0.85 1.04 1.02 1.10 1.13 1.13
6 Intangibles 2.31 1.94 1.46 1.52 1.35 1.18 1.15 0.90 0.71 0.72 0.73
7 Other Long-Term Assets — — — — — — — — — — —
8 Total Assets 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
9 Accounts Payable 0.03 0.06 0.12 0.18 0.19 0.17 0.17 0.16 0.12 0.12 1.06
10 Short-Term Debt — — — — — — — — — — —
11 Taxes Payable 0.03 0.06 0.12 0.18 0.19 0.17 0.17 0.16 0.12 0.12 1.06
12 Accrued Liabilities — — — — — — — — — — —
13 Other Short-Term Liabilities — — — — — — — — — — —
14 Total Current Liabilities — — — — — — — — — — —
15 Long-Term Debt — — — — — — — — — — —
16 Other Long-Term Liabilities — — — — — — — — — — —
17 Total Liabilities 93.67 93.67 94.91 93.74 92.60 93.20 92.87 93.14 93.60 92.48 —
18 Total Stockholders' Equity 6.33 6.33 5.09 6.26 7.40 6.80 7.13 6.86 6.40 7.52 100.00
19 Total Liabilities & Equity 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
--------------------------------------------------------------------------------
Liquidity/Financial Health 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 Latest Qtr
0 Current Ratio — — — — — — — — — — —
1 Quick Ratio — — — — — — — — — — —
2 Financial Leverage 15.79 15.79 19.65 15.97 13.52 14.71 14.03 14.58 15.63 13.29 13.63
3 Debt/Equity — — — — — — — — — — —
--------------------------------------------------------------------------------
Efficiency 2006-12 2007-12 2008-12 2009-12 2010-12 2011-12 2012-12 2013-12 2014-12 2015-12 TTM
0 Days Sales Outstanding — — — — — — — — — — —
1 Days Inventory — — — — — — — — — — —
2 Payables Period — — — — — — — — — — —
3 Cash Conversion Cycle — — — — — — — — — — —
4 Receivables Turnover — — — — — — — — — — —
5 Inventory Turnover — — — — — — — — — — —
6 Fixed Assets Turnover 4.19 4.08 3.91 3.43 3.53 3.49 3.01 2.53 2.18 1.36 1.16
7 Asset Turnover 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.03 0.02 0.02 0.01
--------------------------------------------------------------------------------