Python,使用BeautifulSoup解析表中的值

时间:2015-12-01 09:24:54

标签: python parsing beautifulsoup

我正在解析保存的.html文档中的表格,如下所示:

enter image description here

html代码如下:

<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;"><tbody>
                                        <tr><td><ul><li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">↑</span></li><li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">↑</span></li><li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">↑</span></li><li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">↓</span></li><li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">↓</span></li><li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">↓</span></li><li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">↓</span></li><li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">↓</span></li><li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">↑</span></li><li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">↓</span></li><li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">↓</span></li><li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">↑</span></li><li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">↑</span></li><li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">↓</span></li></ul></td></tr>
                                    </tbody></table>

到目前为止我所拥有的是:

list_a = soup.find_all('table')[0].tbody.find_all("tr")

for a in list_a:
    for b in a:
        for c in b:
            for d in c:
                for e in d:
                    print e.renderContents()

即使看起来不是很好,但结果如下:

15:00:19
11.750
5392
↑
14:56:55
11.750
17
↑
14:56:52
11.750
479
↑

然而,表中的内容太多,我只想要表中的前10组数据。并且只有第3和第4项放在2个列表中。

[“5392”, “17”, “479”, …] 

[“↑”, “↑”, “↑”, …] #the “↑” can be changed to something else identical if it's a problem

我怎样才能实现这一目标?感谢。

2 个答案:

答案 0 :(得分:2)

为什么你没有尝试直接找到所有跨度物品,因为这是你真正想要的? 而不是

list_a = soup.find_all('table')[0].tbody.find_all("tr")

list_a = soup.find_all('table')[0].tbody.find_all("tr")[0].find_all("span")

我不知道你的桌子是否只有一排。如果是,这个shoudl工作并给你所有的跨度,你只需跳过你不需要的那个。如果你有多行,你必须迭代像这样的行

list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
    a.find_all("span")

再次获得所有跨度项目。我希望这能引导你朝着正确的方向前进!

答案 1 :(得分:1)

以下内容将使用span元素中的li标记提取您的两列:

html = """
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;">
<tbody>
<tr>
    <td>
    <ul>
    <li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">?</span></li>
    <li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">?</span></li>
    <li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">?</span></li>
    <li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">?</span></li>
    <li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">?</span></li>
    <li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">?</span></li>
    <li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">?</span></li>
    <li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">?</span></li>
    <li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">?</span></li>
    <li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">?</span></li>
    <li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">?</span></li>
    <li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">?</span></li>
    <li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">?</span></li>
    <li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">?</span></li>
    </ul>
    </td>
</tr>
</tbody></table>"""

soup = BeautifulSoup(html)

col_3 = []
col_4 = []

for li in soup.find_all('table')[0].find_all("li"):
    cols = li.find_all("span")
    col_3.append(cols[2].text)
    col_4.append(cols[3].text)

print col_3 
print col_4

这会给你以下输出:

[u'5392', u'17', u'479', u'6', u'333', u'21', u'15', u'35', u'11', u'3', u'24', u'291', u'198', u'15']
[u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?']