Question

我想保存网站上的所有图片。 wget很糟糕，至少在http://www.leveldesigninspirationmachine.tumblr.com，因为在图像文件夹中它只删除了html文件，而没有作为扩展名。

我找到了一个python脚本，用法是这样的：

[python] ImageDownloader.py URL MaxRecursionDepth DownloadLocationPath MinImageFileSize

最后我在一些BeautifulSoup问题之后运行了脚本。但是，我无法在任何地方找到这些文件。我也试过＆＃34; /＆＃34;作为输出目录希望图像得到我的高清的根，但没有运气。有人可以帮我简化脚本，使其输出到终端设置的cd目录。或者给我一个应该工作的命令。我没有蟒蛇经验，我也不想学习python，因为这个2年前的剧本可能没有按照我想要的方式工作。

另外，我如何通过网站数组？有很多刮刀，它给了我页面的前几个结果。 Tumblr有滚动负载，但没有效果，所以我想添加/page1等。

提前致谢

# imageDownloader.py
# Finds and downloads all images from any given URL recursively.
# FB - 201009094
import urllib2
from os.path import basename
import urlparse
#from BeautifulSoup import BeautifulSoup # for HTML parsing
import bs4
from bs4 import BeautifulSoup

global urlList
urlList = []

# recursively download images starting from the root URL
def downloadImages(url, level, minFileSize): # the root URL is level 0
    # do not go to other websites
    global website
    netloc = urlparse.urlsplit(url).netloc.split('.')
    if netloc[-2] + netloc[-1] != website:
        return

    global urlList
    if url in urlList: # prevent using the same URL again
        return

    try:
        urlContent = urllib2.urlopen(url).read()
        urlList.append(url)
        print url
    except:
        return

    soup = BeautifulSoup(''.join(urlContent))
    # find and download all images
    imgTags = soup.findAll('img')
    for imgTag in imgTags:
        imgUrl = imgTag['src']
        # download only the proper image files
        if imgUrl.lower().endswith('.jpeg') or \
            imgUrl.lower().endswith('.jpg') or \
            imgUrl.lower().endswith('.gif') or \
            imgUrl.lower().endswith('.png') or \
            imgUrl.lower().endswith('.bmp'):
            try:
                imgData = urllib2.urlopen(imgUrl).read()
                if len(imgData) >= minFileSize:
                    print "    " + imgUrl
                    fileName = basename(urlsplit(imgUrl)[2])
                    output = open(fileName,'wb')
                    output.write(imgData)
                    output.close()
            except:
                pass
    print
    print

    # if there are links on the webpage then recursively repeat
    if level > 0:
        linkTags = soup.findAll('a')
        if len(linkTags) > 0:
            for linkTag in linkTags:
                try:
                    linkUrl = linkTag['href']
                    downloadImages(linkUrl, level - 1, minFileSize)
                except:
                    pass

# main
rootUrl = 'http://www.leveldesigninspirationmachine.tumblr.com'
netloc = urlparse.urlsplit(rootUrl).netloc.split('.')
global website
website = netloc[-2] + netloc[-1]
downloadImages(rootUrl, 1, 50000)

Answer 1

正如Frxstream所评论的那样，该程序会在当前目录中创建文件（即您运行它的位置）。运行程序后，运行ls -l（或dir）以查找已创建的文件。

如果它似乎没有创建任何文件，那么很可能它确实没有创建任何文件，很可能是因为你的except: pass隐藏了一个例外。要查看出现了什么问题，请将try: ... except: pass替换为...，然后重新运行该程序。（如果您无法理解并解决此问题，请另外询问StackOverflow问题。）

Answer 2

在没有查看错误的情况下很难判断（+1关闭你的try / except块以便你可以看到异常）但我确实在这里看到一个拼写错误：

fileName = basename(urlsplit(imgUrl)[2])

你没有＆＃34;来自urlparse import urlsplit＆＃34;你有＆＃34;导入urlparse＆＃34;所以你需要像在其他地方一样将它称为urlparse.urlsplit（），所以应该像这样

fileName = basename(urlparse.urlsplit(imgUrl)[2])

设置python脚本的输出位置

2 个答案: