我正在尝试仅从Blogspot中提取某些部分的链接。但输出显示代码提取页面内的所有链接。
以下是代码:
import urlparse
import urllib
from bs4 import BeautifulSoup
url = "http://ellywonderland.blogspot.com/"
urls = [url]
visited = [url]
while len(urls) >0:
try:
htmltext = urllib.urlopen(urls[0]).read()
except:
print urls[0]
soup = BeautifulSoup(htmltext)
urls.pop(0)
print len (urls)
for tags in soup.find_all(attrs={'class': "post-title entry-title"}):
for tag in soup.findAll('a',href=True):
tag['href'] = urlparse.urljoin(url,tag['href'])
if url in tag['href'] and tag['href'] not in visited:
urls.append(tag['href'])
visited.append(tag['href'])
print visited
以下是我要提取的部分的html代码:
<h3 class="post-title entry-title" itemprop="name">
<a href="http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html">Pre-wedding * Vintage*</a>
谢谢。
答案 0 :(得分:2)
如果你不一定需要使用BeautifulSoup
,我认为做这样的事情会更容易:
import feedparser
url = feedparser.parse('http://ellywonderland.blogspot.com/feeds/posts/default?alt=rss')
for x in url.entries:
print str(x.link)
<强>输出:强>
http://ellywonderland.blogspot.com/2011/03/my-vintage-pre-wedding.html
http://ellywonderland.blogspot.com/2011/02/pre-wedding-vintage.html
http://ellywonderland.blogspot.com/2010/12/tissue-paper-flower-crepe-paper.html
http://ellywonderland.blogspot.com/2010/12/menguap-menurut-islam.html
http://ellywonderland.blogspot.com/2010/12/weddings-idea.html
http://ellywonderland.blogspot.com/2010/12/kawin.html
http://ellywonderland.blogspot.com/2010/11/vitamin-c-collagen.html
http://ellywonderland.blogspot.com/2010/11/port-dickson.html
http://ellywonderland.blogspot.com/2010/11/ellys-world.html
feedparser可以解析blogspot页面的RSS提要,并且可以返回您想要的数据,在这种情况下为帖子标题href
。
答案 1 :(得分:0)
您需要将.get添加到对象:
print Objecta.get(&#39; href&#39;)
来自http://www.crummy.com/software/BeautifulSoup/bs4/doc/的示例:
for link in soup.find_all('a'):
print(link.get('href'))