我是python的新手。我正在为我工作的公司构建一个爬虫。抓取其网站,有一个内部链接不是它所习惯的链接格式。如何才能获取整个链接而不是目录。如果我不太清楚,请运行我下面的代码:
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
for link in soup.find_all('a'):
print (link.get('href'))
print soup
print get_first_page('http://www.fashionroom.com.br')
print web_page_string
答案 0 :(得分:0)
请大家回答我试图在脚本中添加if的答案。如果有人发现我将来会发现某些潜在问题,请告诉我
import urllib2
from bs4 import BeautifulSoup
web_page_string = []
def get_first_page(seed):
response = urllib2.urlopen(seed)
web_page = response.read()
soup = BeautifulSoup(web_page)
final_page_string = soup.get_text()
for link in soup.find_all('a'):
if (link.get('href'))[0:4]=='http':
print (link.get('href'))
else:
print seed+'/'+(link.get('href'))
print final_page_string
print get_first_page('http://www.fashionroom.com.br')
print web_page_string