如何从beautifulsoup获取完整链接,而不仅仅是内部链接

时间:2015-04-05 14:35:31

标签: python web-scraping beautifulsoup web-crawler

我是python的新手。我正在为我工​​作的公司构建一个爬虫。抓取其网站,有一个内部链接不是它所习惯的链接格式。如何才能获取整个链接而不是目录。如果我不太清楚,请运行我下面的代码:

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    for link in soup.find_all('a'):
        print (link.get('href'))
    print soup


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

1 个答案:

答案 0 :(得分:0)

请大家回答我试图在脚本中添加if的答案。如果有人发现我将来会发现某些潜在问题,请告诉我

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    final_page_string = soup.get_text()
    for link in soup.find_all('a'):
        if (link.get('href'))[0:4]=='http':
            print (link.get('href'))
        else:
            print seed+'/'+(link.get('href'))
    print final_page_string


print get_first_page('http://www.fashionroom.com.br')
print web_page_string