Question

使用以下代码，每个图像被保存两次。如何跳过已经保存的图像？

import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata


i = 1
soup = make_soup("https://www./")
for img in soup.findAll('img'):
    temp = img.get('src')
    image = temp
    if str(image):
        filename = str(i)
        i = i + 1
        imagefile = open(filename + '.png', 'wb')
        imagefile.write(urllib.request.urlopen(image).read())
        imagefile.close()

Answer 1

您可以使用 set 结构来过滤出重复项，如下所示：

unique_srcs = list(set([img.get('src') for img in soup.findAll('img')]))
for img_src in unique_srcs:
    filename = str(i)
    i = i + 1
    imagefile = open(filename + '.png', 'wb')
    imagefile.write(urllib.request.urlopen(img_src).read())
    imagefile.close()

现在，请记住，这可能会更改文件的顺序。如果您无力更改顺序，则可以通过遍历soup.findAll()列表并检查每个元素的 src 是否不在另一个 unique_list 中来实现相同的目的。 em>，然后将该元素附加到srcs的 unique_list 中，然后像我一样遍历它。

编辑：要保持顺序，请使用此代码，而不要使用unique_srcs数组的列表理解。

unique_srcs = [] for img in soup.findAll('img'): if img.get('src') not in unique_srcs: unique_srcs.append(img.get('src'))

您的第一个元素将是unique_srcs[0]（您情况下的徽标）。

使用美丽汤保存了两次图像

1 个答案: