Question

我的错误：

文件“ C：/ Users / hp dv4 / PycharmProjects / project / imagescrap.py”，行 22，在 imagefile.write（urllib.request.urlopen（img_src）.read（）） ValueError：未知的网址类型：'/img/logo_with_text.png'

我在通过指定的网站爬网时遇到此错误，而相同的代码在某些其他网站上也可以正常工作。

import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata


i = 1
soup = make_soup("http://ioe.edu.np/")

unique_srcs = []
for img in soup.findAll('img'):
    if img.get('src') not in unique_srcs:
        unique_srcs.append(img.get('src'))
for img_src in unique_srcs:
    filename = str(i)
    i = i + 1
    imagefile = open(filename + '.png', 'wb')
    imagefile.write(urllib.request.urlopen(img_src).read())
    imagefile.close()

Answer 1

如错误消息所述：

 unknown url type: '/img/logo_with_text.png'

在http://ioe.edu.np/前面添加img_src，它应该可以工作

Answer 2

上面的代码将再遇到一个错误。

您正在尝试保存每个扩展名为.png的文件，这可能会使文件不可读。

import urllib.request
from bs4 import BeautifulSoup


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata


base_url = "http://ioe.edu.np/"
soup = make_soup(base_url)

unique_srcs = []
for img in soup.findAll('img'):
    if img.get('src') not in unique_srcs:
        unique_srcs.append(img.get('src'))

for i, img_src in enumerate(unique_srcs):
    print(img_src)
    filename = str(i)
    extension = img_src.split('.')[-1]
    with open(filename+'.'+extension, 'wb') as f:
        f.write(urllib.request.urlopen(base_url+img_src).read())

一些惯用的python建议：

使用枚举而不是尝试管理计数器。
使用with-open构造来关闭文件。

您可以做进一步改进的另一件事：

使用集合而不是列表，这样您就不会两次下载相同的文件。

未知的网址类型：图片抓取

2 个答案: