我有以下“网站”(这是html的一部分):
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
我想提取 sometext 和 somelink 。为此,我编写了python代码,这里是:
for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
if not("video" in (link['href'])):
print "Name: "+link.text
#sibling_page=urllib2.urlopen("major_link"+link['href'])
print " Link extracted: "+link['href']
但是,此代码不打印任何内容。你能说出我的错误在哪里吗?
答案 0 :(得分:1)
您的div
没有href
属性。您必须在<a>
元素处查看一个级别。
from bs4 import BeautifulSoup
html = """
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html)
for links in soup.find_all("div", "moduleBody"):
for link in links.find_all("div", "feature"):
for a in links.find_all("a"):
if not "video" in a['href']:
print("Name: " + a.text)
print("Link extracted: " + a['href'])
打印:
Name: sometext
Link extracted: somelink
Name: sometext
Link extracted: somelink
它找到了两次,因为你的HTML坏了。 BeautifulSoup将其修复如下:
<div class="moduleBody">
<div class="feature">
<div class="feature">
<h2>
<a href="somelink">
sometext
</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">
22 Mar 2014
</span>
</span>
</div>
</div>
</div>
</div>
答案 1 :(得分:0)
在您的第二个for
循环中,您的link
变量保留对<h2>...</h2>
的引用,该href
没有属性<div class="feature">
。
这在很大程度上取决于您的结构,但如果<h2>
标记始终以仅包含<a>
标记的for links in soup.find_all('div','moduleBody'):
for link in links.find_all('div','feature'):
anchor_tag = link.h2.a
if not 'video' in anchor_tag['href']:
print 'Name: %s' % anchor_tag.text
print 'Link extracted: %s' % anchor_tag['href']
标记开头,那么您可以执行的操作是:
<div class="feature">
顺便说一句,您的HTML格式不正确,应该关闭第一个<div class="moduleBody">
<div class="feature"></div>
<div class="feature">
<h2>
<a href="somelink">sometext</a>
</h2>
<div class="relatedInfo">
<span class="relatedTopics">
<span class="timestamp">22 Mar 2014</span>
</div>
</div>
</div>
标记。
{{1}}