Question

我试图在每个ul类中提取所有文本＆＃34; vote-description＆＃34;条目，但只有类标记之前的文本 - I.E，之前的所有内容

我的代码会从条目中提取所有内容，如下所示：

import urllib2
from BeautifulSoup import *

response = urllib2.urlopen("https://www.theyworkforyou.com/mp/10001/diane_abbott/hackney_north_and_stoke_newington/votes")
html = response.read()
soup = BeautifulSoup(html)

desc = soup.findAll('ul',{'class':"vote-descriptions"})
for line in desc:
print line.findAll('li')

如何更改此代码，以排除除“a＆＃39;之前”部分之外的所有内容。标签。例如，找到的第一个列表条目是：

<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999&ndash;2014</a>
</li>

但我要打印的是：

Generally voted for equal <b>gay rights</b>

非常感谢任何帮助！

Answer 1

使用contents BeautifulSoup Doc - .contents and .children

for line in desc:
    li_list = line.findAll('li')
    for li in li_list:
        print "%s%s" % (li.contents[0],li.b)

测试

from bs4 import BeautifulSoup
html ="""<li> Generally voted for equal <b>gay rights</b> <a class="vote-description__source" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">Show votes</a> <a class="vote-description__evidence" href="/mp/10001/diane_abbott/hackney_north_and_stoke_newington/divisions?policy=826">11 votes for, 1 vote against, 15 absences, between 1999&ndash;2014</a></li>"""
soup = BeautifulSoup(html)
for line in soup.findAll('li'):
    print "%s%s" % (line.contents[0],line.b)

输出

$ python test.py 
Generally voted for equal  <b>gay rights</b>

如何使用Python和BeautifulSoup在一个HTML类中获取信息

1 个答案: