使用Beautifulsoup从网站上剥离链接

时间:2013-12-07 08:35:17

标签: python beautifulsoup

我在Python中使用beautifulsoup时,努力从下面的“结果”中获取链接(即/d/Hinchinbrook+25691+Masjid-Bilal)。请帮帮忙?

结果:

<div class="subtitleLink"><a href="/d/Hinchinbrook+25691+Masjid-Bilal"><b>Masjid Bilal</b></a></div>

代码:

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
results = soup.findAll("div", {"class" : "subtitleLink"})
for result in results :
print result
br = result.find('a')
pos = br.get_text()
print pos

2 个答案:

答案 0 :(得分:2)

import urllib2
from bs4 import BeautifulSoup

url1 = "http://www.salatomatic.com/c/Sydney+168"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1)
for link in soup.findAll('a'):
   print link.get('href')

如果您想要所有链接,这应该有用。如果没有,请告诉我。

答案 1 :(得分:2)

get_text方法仅返回标记的字符串组件。要获取此处的链接,请将其引用为attribute。对于此特定情况,您可以将br.get_text()更改为br['href']以获得所需的结果。

...
>>> br = result.find('a')
>>> pos = br['href']
>>> print pos
/d/Hinchinbrook+25691+Masjid-Bilal