我正在解析保存的.html文档中的表格,如下所示:
html代码如下:
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;"><tbody>
<tr><td><ul><li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">↑</span></li><li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">↑</span></li><li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">↑</span></li><li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">↓</span></li><li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">↓</span></li><li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">↓</span></li><li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">↓</span></li><li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">↓</span></li><li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">↑</span></li><li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">↓</span></li><li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">↓</span></li><li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">↑</span></li><li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">↑</span></li><li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">↓</span></li></ul></td></tr>
</tbody></table>
到目前为止我所拥有的是:
list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
for b in a:
for c in b:
for d in c:
for e in d:
print e.renderContents()
即使看起来不是很好,但结果如下:
15:00:19
11.750
5392
↑
14:56:55
11.750
17
↑
14:56:52
11.750
479
↑
然而,表中的内容太多,我只想要表中的前10组数据。并且只有第3和第4项放在2个列表中。
即
[“5392”, “17”, “479”, …]
和
[“↑”, “↑”, “↑”, …] #the “↑” can be changed to something else identical if it's a problem
我怎样才能实现这一目标?感谢。
答案 0 :(得分:2)
为什么你没有尝试直接找到所有跨度物品,因为这是你真正想要的? 而不是
list_a = soup.find_all('table')[0].tbody.find_all("tr")
试
list_a = soup.find_all('table')[0].tbody.find_all("tr")[0].find_all("span")
我不知道你的桌子是否只有一排。如果是,这个shoudl工作并给你所有的跨度,你只需跳过你不需要的那个。如果你有多行,你必须迭代像这样的行
list_a = soup.find_all('table')[0].tbody.find_all("tr")
for a in list_a:
a.find_all("span")
再次获得所有跨度项目。我希望这能引导你朝着正确的方向前进!
答案 1 :(得分:1)
以下内容将使用span
元素中的li
标记提取您的两列:
html = """
<table id="detailBody" width="100%" cellspacing="0" cellpadding="0" border="0" class="tab2" style="display: block;">
<tbody>
<tr>
<td>
<ul>
<li><span>15:00:19</span><span class="red">11.750</span><span class="red">5392</span><span class="fr red">?</span></li>
<li><span>14:56:55</span><span class="red">11.750</span><span class="red">17</span><span class="fr red">?</span></li>
<li><span>14:56:52</span><span class="red">11.750</span><span class="red">479</span><span class="fr red">?</span></li>
<li><span>14:56:49</span><span class="">11.740</span><span class="green">6</span><span class="fr green">?</span></li>
<li><span>14:56:46</span><span class="">11.740</span><span class="green">333</span><span class="fr green">?</span></li>
<li><span>14:56:43</span><span class="">11.740</span><span class="green">21</span><span class="fr green">?</span></li>
<li><span>14:56:40</span><span class="">11.740</span><span class="green">15</span><span class="fr green">?</span></li>
<li><span>14:56:37</span><span class="">11.740</span><span class="green">35</span><span class="fr green">?</span></li>
<li><span>14:56:34</span><span class="red">11.750</span><span class="red">11</span><span class="fr red">?</span></li>
<li><span>14:56:31</span><span class="">11.740</span><span class="green">3</span><span class="fr green">?</span></li>
<li><span>14:56:28</span><span class="">11.740</span><span class="green">24</span><span class="fr green">?</span></li>
<li><span>14:56:22</span><span class="red">11.750</span><span class="red">291</span><span class="fr red">?</span></li>
<li><span>14:56:19</span><span class="">11.740</span><span class="red">198</span><span class="fr red">?</span></li>
<li><span>14:56:16</span><span class="green">11.730</span><span class="green">15</span><span class="fr green">?</span></li>
</ul>
</td>
</tr>
</tbody></table>"""
soup = BeautifulSoup(html)
col_3 = []
col_4 = []
for li in soup.find_all('table')[0].find_all("li"):
cols = li.find_all("span")
col_3.append(cols[2].text)
col_4.append(cols[3].text)
print col_3
print col_4
这会给你以下输出:
[u'5392', u'17', u'479', u'6', u'333', u'21', u'15', u'35', u'11', u'3', u'24', u'291', u'198', u'15']
[u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?', u'?']