Question

我是python的新手。我正在为我工作的公司构建一个爬虫。抓取其网站，有一个内部链接不是它所习惯的链接格式。如何才能获取整个链接而不是目录。如果我不太清楚，请运行我下面的代码：

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    for link in soup.find_all('a'):
        print (link.get('href'))
    print soup


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

Answer 1

请大家回答我试图在脚本中添加if的答案。如果有人发现我将来会发现某些潜在问题，请告诉我

import urllib2
from bs4 import BeautifulSoup

web_page_string = []

def get_first_page(seed):
    response = urllib2.urlopen(seed)
    web_page = response.read()
    soup = BeautifulSoup(web_page)
    final_page_string = soup.get_text()
    for link in soup.find_all('a'):
        if (link.get('href'))[0:4]=='http':
            print (link.get('href'))
        else:
            print seed+'/'+(link.get('href'))
    print final_page_string


print get_first_page('http://www.fashionroom.com.br')
print web_page_string

如何从beautifulsoup获取完整链接，而不仅仅是内部链接

1 个答案: