如何使用Python和BeautifulSoup在一个HTML类中获取信息

时间:2017-03-30 21:40:28

标签: python beautifulsoup

在本网站上:cmake-generator-expressions

我试图在每个ul类中提取所有文本" vote-description"条目,但只有类标记之前的文本 - I.E,之前的所有内容

我的代码会从条目中提取所有内容,如下所示:

import urllib2
from BeautifulSoup import *

response = urllib2.urlopen("https://www.theyworkforyou.com/mp/10001/diane_abbott/hackney_north_and_stoke_newington/votes")
html = response.read()
soup = BeautifulSoup(html)

desc = soup.findAll('ul',{'class':"vote-descriptions"})
for line in desc:
print line.findAll('li')

如何更改此代码,以排除除“a'之前”部分之外的所有内容。标签。例如,找到的第一个列表条目是:

<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999&ndash;2014</a>
</li>

但我要打印的是:

Generally voted for equal <b>gay rights</b>

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:0)

使用contents BeautifulSoup Doc - .contents and .children

for line in desc:
    li_list = line.findAll('li')
    for li in li_list:
        print "%s%s" % (li.contents[0],li.b)

测试

from bs4 import BeautifulSoup
html ="""<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999&ndash;2014</a></li>"""
soup = BeautifulSoup(html)
for line in soup.findAll('li'):
    print "%s%s" % (line.contents[0],line.b)

输出

$ python test.py 
Generally voted for equal  <b>gay rights</b>