在本网站上:cmake-generator-expressions
我试图在每个ul类中提取所有文本" vote-description"条目,但只有类标记之前的文本 - I.E,之前的所有内容
我的代码会从条目中提取所有内容,如下所示:
import urllib2
from BeautifulSoup import *
response = urllib2.urlopen("https://www.theyworkforyou.com/mp/10001/diane_abbott/hackney_north_and_stoke_newington/votes")
html = response.read()
soup = BeautifulSoup(html)
desc = soup.findAll('ul',{'class':"vote-descriptions"})
for line in desc:
print line.findAll('li')
如何更改此代码,以排除除“a'之前”部分之外的所有内容。标签。例如,找到的第一个列表条目是:
<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999–2014</a>
</li>
但我要打印的是:
Generally voted for equal <b>gay rights</b>
非常感谢任何帮助!
答案 0 :(得分:0)
使用contents
BeautifulSoup Doc - .contents and .children
for line in desc:
li_list = line.findAll('li')
for li in li_list:
print "%s%s" % (li.contents[0],li.b)
测试
from bs4 import BeautifulSoup
html ="""<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999–2014</a></li>"""
soup = BeautifulSoup(html)
for line in soup.findAll('li'):
print "%s%s" % (line.contents[0],line.b)
输出
$ python test.py
Generally voted for equal <b>gay rights</b>