我正在尝试使用BeautifulSoup提取带有id =“titleDescription”的span所包含的字符串。
<div class="itemText">
<div class="wrapper">
<span class="itemPromo">Customer Choice Award Winner</span>
<a href="http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501" title="View Details" >
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
<span class="itemDescription" id="lineDescriptionID" style="display:none">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
</a>
</div>
代码段
f = open('egg.data', 'rb')
content = f.read()
content = content.decode('utf-8', 'replace')
content = ''.join([x for x in content if ord(x) < 128])
soup = bs(content)
for itemText in soup.find_all('div', attrs={'class':'itemText'}):
wrapper = itemText.div
wrapper_href = wrapper.a
for child in wrapper_href.descendants:
if child['id'] == 'titleDescriptionID':
print(child, "\n")
追踪错误:
Traceback (most recent call last):
File "egg.py", line 66, in <module>
if child['id'] == 'titleDescriptionID':
TypeError: string indices must be integers
答案 0 :(得分:11)
spans = soup.find_all('span', attrs={'id':'titleDescriptionID'})
for span in spans:
print span.string
在您的代码中,wrapper_href.descendants
包含至少4个元素,2个span标记和2个span标记所包含的2个字符串。它递归地搜索它的孩子。
答案 1 :(得分:2)
wrapper_href.descendants
包含任何NavigableString
objects,这就是你所绊倒的。 NavigableString
本质上是字符串对象,您尝试使用child['id']
行对其进行索引:
>>> next(wrapper_href.descendants)
u'\n'
为什么不直接使用itemText.find('span', id='titleDescriptionID')
加载代码?
演示:
>>> for itemText in soup.find_all('div', attrs={'class':'itemText'}):
... print itemText.find('span', id='titleDescriptionID')
... print itemText.find('span', id='titleDescriptionID').text
...
<span class="itemDescription" id="titleDescriptionID" style="display:inline">Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K</span>
Intel Core i7-3770K Ivy Bridge 3.5GHz (3.9GHz Turbo) LGA 1155 77W Quad-Core Desktop Processor Intel HD Graphics 4000 BX80637I73770K
答案 2 :(得分:0)
from BeautifulSoup import BeautifulSoup
pool = BeautifulSoup(html) # where html contains the whole html as string
for item in pool.findAll('span', attrs={'id' : 'titleDescriptionID'}):
print item.string
当我们使用BeautifulSoup搜索标签时,我们得到一个BeautifulSoup.Tag对象,它可以直接用于访问其他内容,样式,href等属性。