Question

我正在尝试创建一个脚本，将脚本中的所有图像文件下载并保存到目录中。这是我的代码但是我无法下载文件并保存它们，任何人都可以看到为什么不呢？我知道还有另外一种方法可以使用BeautifulSoup / Soup，但我试图了解正则表达式以及可以用它做些什么。有人可以帮忙吗？

import traceback
import sys, re
from time import sleep
from urllib import urlretrieve

images = re.findall(r'([-\w]+\.(?:jpg))', webpage.read())
try:

    filename='./dogg/file.html'
    urlretrieve('http://dogpicturesite.com/', filename)
    webpage=open(filename, 'r')
    print "Downloading Images....."
    time.sleep(5)
    print "Images Downloaded."
    print images

except:
    print "Failed to Download Images"
    raw_input('Press Enter to exit...')
    sys.exit()

从这个脚本中我可以列出网页上的.jpg文件，我现在想要做的就是下载它们，但是从这里我不确定如何下载它们。我认为使用上面的脚本会更容易，但编辑下面的脚本会更容易吗？

import sys, urllib, re
    def imagefiles(webpage):
        print ' imagefiles()'
        images = re.findall(r'([-\w]+\.(?:jpg))', webpage)

        for image in images:
            print image

    def main():
        sys.argv.append('http://dogpicturesite.com/')
        if len(sys.argv) != 2:
            print '[-] Image Files'
            return
        page = webpage.webpage(sys.argv[1])
        imagefiles(webpage)

Answer 1

我在这里看到三个问题：

您从未定义webpage，但您尝试在此处使用它：
```
images = re.findall(r'([-\w]+\.(?:jpg))', webpage)
```
您需要在此行之前定义webpage。

您直接导入urlretrieve：

from urllib import urlretrieve

因此，您需要删除此行的urllib.部分：

urllib.urlretrieve('http://dogpicturesite.com/', 'C:/images')

您从未导入re或time，但您在代码中使用它们。

但是请注意，所有这些错误（每个错误都会引发NameError）都会被try/except块覆盖/静音。

Answer 2

您使用了语句

from urllib import urlretrieve

但请参阅urllib.urlretrieve。

该行

 urllib.urlretrieve('http://dogpicturesite.com/', 'C:/images')

触发了一个NameError，但是因为你正在使用一个短暂的异常行

except:

它隐藏了这个错误。当我删除普通except:时，我看到了

Traceback（最近一次调用最后一次）：文件“dog.py”，第8行，in urllib.urlretrieve（'http://dogpicturesite.com/'，'C：/ images'）NameError：名称'urllib'未定义

该行将是

     urlretrieve('http://dogpicturesite.com/', 'C:/images')

不会触发名称错误。

Python的一个好规则是只捕获您期望的异常，例如

除了IOError：

因为在文件写入过程中可能会发生IOError。但是，NameError只应该由于编程错误而发生，并且您不希望以相同的方式隐藏或处理它。

接下来，urllib.urlretrieve不会将目录作为参数 - 它需要一个文件名。否则，它会告诉你

IOError：[Errno 21]是一个目录：'。/ dogg'

接下来，现在我们知道urlretrieve保存到文件......我们必须打开文件。将第一部分更改为

filename='./dogg/file.html'
urlretrieve('http://dogpicturesite.com/', filename)
webpage=open(filename, 'r')

正在执行，它将我们带到隐藏的下一个异常：re模块尚未导入，因此 images = re.findall（r'（[ - \ w] +。（？：jpg））'，网页）

触发了NameError。

添加 import re

到顶部。

然后，下一行将是

images = re.findall(r'([-\w]+\.(?:jpg))', webpage.read())

但是我们之前也没有导入time，所以我们得到了一个名称错误。添加

from time import sleep

到顶部并将该行更改为睡眠（5）

现在程序运行没有错误。

然而！请注意，它实际上并没有下载任何图像，因为它没有对images变量做任何事情。至少添加一个

print images

你可以看到正则表达式是如何工作的。我得到了

jal@squiddle:~$ python dog.py 
['instrument-dog-184x184.jpg', 'instrument-dog.jpg', 'wallpaper-christmas-chihuahua-135x80.jpg', 'more-135x80.jpg', 'instrument-dog-184x184.jpg', 'more-184x184.jpg', 'eye-covered-184x184.jpg', 'cute-puppy-184x184.jpg', 'hello-dog-184x184.jpg', 'bathing-dog-184x184.jpg', 'screaming-dog-184x184.jpg', 'patches-and-dylan-184x184.jpg', 'cast-dog-184x184.jpg', 'screaming-puppy-184x184.jpg', 'miserable-dog-184x184.jpg', 'sun-dog-184x184.jpg', 'sleeping-dog-184x184.jpg', '291638_10150913381017747_226545279_o-184x184.jpg', 'swimming-dogs-184x184.jpg', 'chores-dog-184x184.jpg', 'IMG_20120701_0354361-184x184.jpg', 'close-up-dog1-184x184.jpg', 'let-the-dog-in-184x184.jpg', 'baths-184x184.jpg']

Answer 3

你看过pyparsing吗？它肯定会在短时间内为您删除所有图像链接，并返回链接给您下载。

如果你通过这里列出的examples那么你就可以改变它的味道了。另请查看此链接Replace SRC of all IMG elements using Parser

从URL保存图像

3 个答案: