我正在尝试使用find_all来抓取“&span”中的任何内容。标签也是一个' a'标记并具有itemprop="foo"
属性。我使用的是bs4。见下文。
text = '<a><span itemprop="foo"> TEXT I WANT </span></a> \
<label><span itemprop="foo"> DO NOT WANT </span></label> \
<a><span itemprop="foo"> I WANT THIS TOO </span></a> \
<strong><a> DO NOT WANT </a></strong> \
<label><span itemprop="foo"> DO NOT WANT </span></label>'
soup = BeautifulSoup(text)
我的代码如下:
for stuff in soup.find_all("span", attrs={"itemprop" : "foo"}):
print stuff.text
这会在span标记中删除所有4个文本,而不仅仅是2.我已尝试添加&#39; a&#39;标记为该语法,但我无法获得任何工作。什么是正确的方法?
答案 0 :(得分:0)
解决方法是使用anchor_tags = soup.find_all('a')
查找所有锚标记,然后使用for
循环迭代anchor_tags
并找到span
的文本元素(具有适当的itemprop
属性)是子元素:
from bs4 import BeautifulSoup
text = '<a><span itemprop="foo"> TEXT I WANT </span></a> \
<label><span itemprop="foo"> DO NOT WANT </span></label> \
<a><span itemprop="foo"> I WANT THIS TOO </span></a> \
<strong><a> DO NOT WANT </a></strong> \
<label><span itemprop="foo"> DO NOT WANT </span></label>'
soup = BeautifulSoup(text)
anchor_tags = soup.find_all('a')
for a in anchor_tags:
for span in a.find_all('span', attrs={'itemprop': 'foo'}):
print span.text
<强>输出强>
TEXT I WANT
I WANT THIS TOO