Question

我正在尝试解析所有img标签的html，下载src指向的所有图像，然后将这些文件添加到zip文件中。我宁愿在记忆中做所有这些，因为我可以保证不会有那么多的图像。

假设已经通过解析html填充了images变量。我需要帮助的是将图像放入zipfile。

from zipfile import ZipFile
from StringIO import StringIO
from urllib2 import urlopen

s = StringIO()
zip_file = ZipFile(s, 'w')
try:
    for image in images:
        internet_image = urlopen(image)
        zip_file.writestr('some-image.jpg', internet_image.fp.read())
        # it is not obvious why I have to use writestr() instead of write()
finally:
    zip_file.close()

Answer 1

我不太确定你在这里问的是什么，因为你看起来大部分已经整理好了。

您是否调查了HtmlParser以实际执行HTML解析？我不会尝试自己动手解决一个解析器 - 这是一个涉及众多边缘情况的主要任务。除了最微不足道的案例外，甚至不要考虑regexps。

对于每个<img/>代码，您可以使用HttpLib来实际获取每张图片。可能值得在多个线程中获取图像以加速zip文件的编译。

Answer 2

我能想到的最简单的方法就是使用BeautifulSoup库。

有些事情：

from BeautifulSoup import BeautifulSoup
from collections import defaultdict

def getImgSrces(html):
    srcs = []
    soup = BeautifulSoup(html)

    for tag in soup('img'):
        attrs = defaultdict(str)
        for attr in tag.attrs:
            attrs[ attr[0] ] = attr[1]
        attrs = dict(attrs)

        if 'src' in attrs.keys():
            srcs.append( attrs['src'] )

    return srcs

这应该会为您提供一个从您的img标签派生的网址列表。

Answer 3

要回答有关如何创建ZIP存档的具体问题（其他人已在讨论解析URL），我测试了您的代码。你已经非常接近成品了。

以下是我将如何扩充您创建Zip存档所需的内容（在此示例中，我将存档写入驱动器，以便我可以验证它是否已正确编写）。

from zipfile import ZipFile, ZipInfo, ZIP_DEFLATED
import zlib
from cStringIO import StringIO
from urllib2 import urlopen
from urlparse import urlparse
from os import path

images = ['http://sstatic.net/so/img/logo.png', 
          'http://sstatic.net/so/Img/footer-cc-wiki-peak-internet.png']

buf = StringIO()
# By default, zip archives are not compressed... adding ZIP_DEFLATED
# to achieve that. If you don't want that, or don't have zlib on or
# system, delete the compression kwarg
zip_file = ZipFile(buf, mode='w', compression=ZIP_DEFLATED)

for image in images:
    internet_image = urlopen(image)
    fname = path.basename(urlparse(image).path) 
    zip_file.writestr(fname, internet_image.read())

zip_file.close()

output = open('images.zip', 'wb')
output.write(buf.getvalue())
output.close()
buf.close()

解析html文件并将找到的图像添加到zip文件中

3 个答案: