使用Beautifulsoup在html标签之间检索文本

时间:2014-02-25 02:39:04

标签: python-2.7 beautifulsoup

我有一个很大的html文档,我已经解析了这个保存为output.html的html文档片段

<td width="513">
<b>abc (4pa11cs031) </b><br/><br/><br/><br/><hr/><table><tbody><tr><td><b>Semester:</b></td><td><b>5</b></td><td></td><td> Â Â Â Â <b> Result:Â Â FIRST CLASS </b></td></tr></tbody></table><hr/><br/><table><tbody><tr><td width="250">Subject</td><td align="center" width="60">External </td><td align="center" width="60">Internal</td><td align="center" width="60">Total</td><td align="center" width="60">Result</td></tr><tr><td width="250"><i>Software Engineering (10IS51)</i></td><td align="center" width="60">58</td><td align="center" width="60">24</td><td align="center" width="60">82</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Systems Software (10CS52)</i></td><td align="center" width="60">70</td><td align="center" width="60">24</td><td align="center" width="60">94</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Operating Systems (10CS53)</i></td><td align="center" width="60">58</td><td align="center" width="60">18</td><td align="center" width="60">76</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Database Management Systems (10CS54)</i></td><td align="center" width="60">42</td><td align="center" width="60">25</td><td align="center" width="60">67</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Computer Networks - I (10CS55)</i></td><td align="center" width="60">62</td><td align="center" width="60">23</td><td align="center" width="60">85</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Formal Languages &amp; Automata Theory (10CS56)</i></td><td align="center" width="60">37</td><td align="center" width="60">24</td><td align="center" width="60">61</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Database Applications Laboratory (10CSL57)</i></td><td align="center" width="60">40</td><td align="center" width="60">25</td><td align="center" width="60">65</td><td align="center" width="60"><b>P</b></td></tr><tr><td width="250"><i>Systems Software &amp; Operating Systems Lab. (10CSL58)</i></td><td align="center" width="60">40</td><td align="center" width="60">21</td><td align="center" width="60">61</td><td align="center" width="60"><b>P</b></td></tr></tbody></table><br/><br/><table><tbody><tr><td></td><td></td><td>Total Marks:</td><td> 591 Â Â Â  </td></tr></tbody></table> </td>

我正在使用BeautifulSoup来解析和检索表中的值。

def getval():
    records=[]
    page_html=open('output.html')
    soup=BeautifulSoup(page_html)
    soup.prettify()
    all_tds = [td for td in soup.findAll("b")]
    fl = open('output.html', 'wb')
    lol=all_tds[0]
    record = '%s' % (lol)
    fl.write(record)
    fl.close()

假设我想要标签之间的所有内容。目前我将其与带有上述代码的标签一起使用

[<b>abc (4pa11cs031) </b>, <b>Semester:</b>, <b>5</b>, <b> Result:  FIRST CLASS </b>, <b>P</b>, <b>P</b>, <b>P</b>, <b>P</b>, <b>P</b>, <b>P</b>, <b>P</b>, <b>P</b>]

如何获取标签之间的文字?

1 个答案:

答案 0 :(得分:1)

使用.string。参见美丽的汤documentation on .string