使用BeautifulSoup解析表的一部分数据

时间:2016-08-14 09:57:55

标签: python html beautifulsoup

我有一个自我项目,使用BeautifulSoup和Python在线搜索数据,我认为历史股票数据对我来说是一个很好的实践。我查看了源代码here来分析我如何使用BeautifulSoup的select()或findall()来解析表中的部分数据。这是我使用的代码,但它解析了除表之外的其他东西。

soup = bs4.BeautifulSoup(res.text, 'lxml') table = soup.findAll( 'td', {'class':'yfnc_tabledata1'} ) print table

我的问题:如何只解析显示表中2天数据的2行?

以下是包含2天历史数据的表格:

<table class="yfnc_datamodoutline1" width="100%" cellpadding="0" cellspacing="0" border="0">

<tr valign="top">
<td>

<table border="0" cellpadding="2" cellspacing="1" width="100%">
<tr>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Date</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Open</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">High</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">Low</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="12%">close</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="16%">Volume</th>
<th scope="col" class="yfnc_tablehead1" align="right" width="15%">Adj Close*</th>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>
<tr>
<td class="yfnc_tabledata1" colspan="7" align="center">
* <small>Close price adjusted for dividends and splits.</small>
</td>
</tr>
</table>

</td>
</tr>
</table>

我只需要上面的特定2行数据:

<tr>
<td class="yfnc_tabledata1" nowrap align="right">12 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.44</td>
<td class="yfnc_tabledata1" align="right">107.78</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
<td class="yfnc_tabledata1" align="right">18,612,300</td>
<td class="yfnc_tabledata1" align="right">108.18</td>
</tr>
<tr>
<td class="yfnc_tabledata1" nowrap align="right">11 Aug 2016</td>
<td class="yfnc_tabledata1" align="right">108.52</td>
<td class="yfnc_tabledata1" align="right">108.93</td>
<td class="yfnc_tabledata1" align="right">107.85</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
<td class="yfnc_tabledata1" align="right">27,484,500</td>
<td class="yfnc_tabledata1" align="right">107.93</td>
</tr>

1 个答案:

答案 0 :(得分:0)

您可以从 yfnc_datamodoutline1 表中的嵌套表中选择所有行,并为前两个索引:

soup = BeautifulSoup(html)
table_rows = soup.select("table.yfnc_datamodoutline1 table tr + tr")
row1, row2 =  table_rows[0:2]

print(row1)
print(row2)

哪会给你:

<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">12 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.44</td>
<td align="right" class="yfnc_tabledata1">107.78</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
<td align="right" class="yfnc_tabledata1">18,612,300</td>
<td align="right" class="yfnc_tabledata1">108.18</td>
</tr>
<tr>
<td align="right" class="yfnc_tabledata1" nowrap="">11 Aug 2016</td>
<td align="right" class="yfnc_tabledata1">108.52</td>
<td align="right" class="yfnc_tabledata1">108.93</td>
<td align="right" class="yfnc_tabledata1">107.85</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
<td align="right" class="yfnc_tabledata1">27,484,500</td>
<td align="right" class="yfnc_tabledata1">107.93</td>
</tr>

要获取td数据,只需从每个td中提取文本:

print([td.text for td in row1.find_all("td")])
print([td.text for td in row2.find_all("td")])

哪会给你:

[u'12 Aug 2016', u'107.78', u'108.44', u'107.78', u'108.18', u'18,612,300', u'108.18']
[u'11 Aug 2016', u'108.52', u'108.93', u'107.85', u'107.93', u'27,484,500', u'107.93']

table.yfnc_datamodoutline1 table tr + tr 选择内部表中的所有行,跳过第一行,即标题行。