如何使用BeautifulSoup从列表中提取部分项目?

时间:2014-05-21 16:31:48

标签: python beautifulsoup

我使用find.all()

提取了一些数据

这给了我一个包含许多字符串的列表,例如下面的内容。

<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>
<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>
<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>

我需要的是来自<a class ="y">

的文字

我该怎么做?或许使用循环?

1 个答案:

答案 0 :(得分:2)

这是如何使用美丽的汤来做到这一点:

>>> html= '''\
<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>
<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>
<div class="x"><a class="x" href="x"><i class="x"></i></a> <a class="y" href="x">to make</a><span> something</span></div>'''
>>> soup = BeautifulSoup(html)    
>>> list_of_y = soup.findAll("a", {'class': 'y'})

返回您可以打印的项目列表:

>>> print(list_of_y)
[<a class="y" href="x">to make</a>, <a class="y" href="x">to make</a>, <a class="y" href="x">to make</a>]

或迭代:

>>> for y in list_of_y:
...   print(y.text)
to make
to make
to make

但是,我对lxml有一点偏好,那就是:

>>> h = etree.HTML(html)
>>> list_of_y = h.xpath('//a[@class="y"]/text()')
>>> print list_of_y
['to make', 'to make', 'to make']
>>> for y in list_of_y:
...   print(y)
... 
to make
to make
to make

或使用CSS选择器:

>>> from lxml import etree, cssselector
>>> h = etree.HTML(html)
>>> sel = cssselector.CSSSelector('a.y')
>>> list_of_y = sel(h)
>>> for y in list_of_y:
>>>     print(y.text)