使用find_all

时间:2016-02-29 21:06:16

标签: python python-2.7 beautifulsoup

我正在尝试使用find_all来抓取“&span”中的任何内容。标签也是一个' a'标记并具有itemprop="foo"属性。我使用的是bs4。见下文。

text = '<a><span itemprop="foo"> TEXT I WANT </span></a> \
<label><span itemprop="foo"> DO NOT WANT </span></label> \
<a><span itemprop="foo"> I WANT THIS TOO </span></a> \
<strong><a> DO NOT WANT </a></strong> \
<label><span itemprop="foo"> DO NOT WANT </span></label>'

soup = BeautifulSoup(text)

我的代码如下:

for stuff in soup.find_all("span", attrs={"itemprop" : "foo"}):
    print stuff.text

这会在span标记中删除所有4个文本,而不仅仅是2.我已尝试添加&#39; a&#39;标记为该语法,但我无法获得任何工作。什么是正确的方法?

1 个答案:

答案 0 :(得分:0)

解决方法是使用anchor_tags = soup.find_all('a')查找所有锚标记,然后使用for循环迭代anchor_tags并找到span的文本元素(具有适当的itemprop属性)是子元素:

from bs4 import BeautifulSoup

text = '<a><span itemprop="foo"> TEXT I WANT </span></a> \
<label><span itemprop="foo"> DO NOT WANT </span></label> \
<a><span itemprop="foo"> I WANT THIS TOO </span></a> \
<strong><a> DO NOT WANT </a></strong> \
<label><span itemprop="foo"> DO NOT WANT </span></label>'

soup = BeautifulSoup(text)
anchor_tags = soup.find_all('a')
for a in anchor_tags:
    for span in a.find_all('span', attrs={'itemprop': 'foo'}):
        print span.text

<强>输出

 TEXT I WANT 
 I WANT THIS TOO