使用Beautifulsoup从网址中提取链接

时间:2013-12-09 11:34:27

标签: python beautifulsoup

我正在尝试使用beautifulsoup

获取以下网址链接
<div class="alignright single">
<a href="http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/" rel="next">Hadith on Clothing: Women should lower their garments to cover their feet</a> &raquo;    </div>
</div>

我的代码如下

from bs4 import BeautifulSoup                                                                                                                                 
import urllib2                                                                                                
url1 = "http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/"

content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1) 

nextlink = soup.findAll("div", {"class" : "alignright single"})
a = nextlink.find('a')
print a.get('href')

我收到以下错误,请帮助

a = nextlink.find('a')
AttributeError: 'ResultSet' object has no attribute 'find'

2 个答案:

答案 0 :(得分:3)

如果您只想找到一个匹配项,请使用.find()

nextlink = soup.find("div", {"class" : "alignright single"})
所有比赛中

循环

for nextlink in soup.findAll("div", {"class" : "alignright single"}):
    a = nextlink.find('a')
    print a.get('href')

后一部分也可以表示为:

a = nextlink.find('a', href=True)
print a['href']

其中href=True部分仅匹配具有href属性的元素,这意味着您不必使用a.get()因为属性在那里(或者,找不到<a href="...">个链接,a将是None。)

对于您问题中的给定网址,只有一个此类链接,因此.find()可能是最方便的。甚至可以使用:

nextlink = soup.find('a', rel='next', href=True)
if nextlink is not None:
    print a['href']

无需查找周围的divrel="next"属性看起来足以满足您的特定需求。

作为额外提示:使用响应标头告诉BeautifulSoup用于页面的编码; urllib2响应对象可以告诉您服务器认为HTML页面编码的字符集(如果有):

response = urllib2.urlopen(url1)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))

所有部分的快速演示:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> response = urllib2.urlopen('http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-the-lower-garment-should-be-hallway-between-the-shins/')
>>> soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
>>> soup.find('a', rel='next', href=True)['href']
u'http://www.dailyhadithonline.com/2013/07/21/hadith-on-clothing-women-should-lower-their-garments-to-cover-their-feet/'

答案 1 :(得分:1)

您需要解压缩列表,请改为尝试:

nextlink = soup.findAll("div", {"class" : "alignright single"})[0]

或者因为只有一个匹配,find方法也应该起作用:

nextlink = soup.find("div", {"class" : "alignright single"})