同样的事情在2。5年前在Downloading a web page and all of its resource files in Python中提出,但没有得到答案,“请看相关主题”并不是真的在问同样的事情。
我想下载页面上的所有内容,以便只从文件中查看它。
命令
wget --page-requisites --domains = DOMAIN --no-parent --html-extension --convert-links --restrict-file-names = windows
完全符合我的需要。但是我们希望能够将其与其他必须可移植的东西联系起来,因此需要它在Python中。
我一直在看美丽的汤,scrapy,各种蜘蛛贴在这个地方,但这些都似乎处理以巧妙但具体的方式获取数据/链接。使用这些来做我想做的事情似乎需要大量工作来处理找到所有资源,当我确定必须有一个简单的方法。
非常感谢
答案 0 :(得分:3)
您应该使用适当的工具来完成手头的工作。
如果你想蜘蛛网站并将页面保存到磁盘,Python可能不是最好的选择。当有人需要该功能时,开源项目会获得功能,并且因为wget
能够很好地完成工作,所以没有人会费心去编写python库来替换它。
考虑到几乎任何具有Python解释器的平台都可以运行wget,是否有理由不能使用wget?
答案 1 :(得分:2)
我的同事写了这段代码,我相信很多其他来源拼凑在一起。我们的系统可能有一些特定的怪癖,但它应该有助于任何想要做同样的事情
"""
Downloads all links from a specified location and saves to machine.
Downloaded links will only be of a lower level then links specified.
To use: python downloader.py link
"""
import sys,re,os,urllib2,urllib,urlparse
tocrawl = set([sys.argv[1]])
# linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?')
linkregex = re.compile('href=[\'|"](.*?)[\'"].*?')
linksrc = re.compile('src=[\'|"](.*?)[\'"].*?')
def main():
link_list = []##create a list of all found links so there are no duplicates
restrict = sys.argv[1]##used to restrict found links to only have lower level
link_list.append(restrict)
parent_folder = restrict.rfind('/', 0, len(restrict)-1)
##a.com/b/c/d/ make /d/ as parent folder
while 1:
try:
crawling = tocrawl.pop()
#print crawling
except KeyError:
break
url = urlparse.urlparse(crawling)##splits url into sections
try:
response = urllib2.urlopen(crawling)##try to open the url
except:
continue
msg = response.read()##save source of url
links = linkregex.findall(msg)##search for all href in source
links = links + linksrc.findall(msg)##search for all src in source
for link in (links.pop(0) for _ in xrange(len(links))):
if link.startswith('/'):
##if /xxx a.com/b/c/ -> a.com/b/c/xxx
link = 'http://' + url[1] + link
elif ~link.find('#'):
continue
elif link.startswith('../'):
if link.find('../../'):##only use links that are max 1 level above reference
##if ../xxx.html a.com/b/c/d.html -> a.com/b/xxx.html
parent_pos = url[2].rfind('/')
parent_pos = url[2].rfind('/', 0, parent_pos-2) + 1
parent_url = url[2][:parent_pos]
new_link = link.find('/')+1
link = link[new_link:]
link = 'http://' + url[1] + parent_url + link
else:
continue
elif not link.startswith('http'):
if url[2].find('.html'):
##if xxx.html a.com/b/c/d.html -> a.com/b/c/xxx.html
a = url[2].rfind('/')+1
parent = url[2][:a]
link = 'http://' + url[1] + parent + link
else:
##if xxx.html a.com/b/c/ -> a.com/b/c/xxx.html
link = 'http://' + url[1] + url[2] + link
if link not in link_list:
link_list.append(link)##add link to list of already found links
if (~link.find(restrict)):
##only grab links which are below input site
print link ##print downloaded link
tocrawl.add(link)##add link to pending view links
file_name = link[parent_folder+1:]##folder structure for files to be saved
filename = file_name.rfind('/')
folder = file_name[:filename]##creates folder names
folder = os.path.abspath(folder)##creates folder path
if not os.path.exists(folder):
os.makedirs(folder)##make folder if it does not exist
try:
urllib.urlretrieve(link, file_name)##download the link
except:
print "could not download %s"%link
else:
continue
if __name__ == "__main__":
main()
感谢回复