Question

我正试图从网站上获取图片，但收到错误。

这是代码：

url = 'http://www.techradar.com/news/internet/web/12-best-places-to-get-free-images-for-your-site-624818'
image = urlopen(url).read()
patFinderImage = re.compile('.jpg')
imgUrl = re.findall('<img src="(.*)" />', url)
outfile = open('abc.htm', 'wb')
outfile.write(imgUrl)
outfile.close

错误：

Traceback (most recent call last):
  File "C:\Users\joh\workspace\new2\newnewurl.py", line 14, in <module>
    outfile.write(imgUrl)
TypeError: 'list' does not support the buffer interface

Answer 1

re.findall返回找到的字符串列表。因此，imgUrl是一个列表。

您不能write文件的字符串列表，只能是字符串。因此错误信息。

如果你想写出列表的字符串表示（这很容易，但不太可能有用），你可以这样做：

outfile.write(str(imgUrl))

如果您只想写第一个URL（字符串），您可以：

outfile.write(imgUrl[0])

如果您要编写所有网址，每行一个：

for url in imgUrl:
    outfile.write(url + '\n')

或者，既然它是HTML而且空白无关紧要，你可以把它们全部写在一起运行：

outfile.write(''.join(imgUrl))

然后您有第二个问题。出于某种原因，您已在二进制模式下打开文件。我不知道你为什么这样做，但如果你这样做，你只能将bytes写入文件，而不是字符串。但是你没有bytes的列表，你有一个字符串列表。因此，您需要encode将这些字符串转换为字节。例如：

for url in imgUrl:
    outfile.write(url.encode('utf-8') + b'\n')

或者更好 - 只是不要以二进制模式打开文件：

outfile = open('abc.htm', 'w')

如果要指定显式编码，仍可以不使用二进制模式执行此操作：

outfile = open('abc.htm', 'w', encoding='utf-8')

您可能还有第三个问题。根据您的评论，imgUrl[0]似乎为您提供了IndexError。这意味着它是空的。这意味着你的正则表达式实际上并没有找到任何要写的URL。在这种情况下，你显然无法成功地将它们写出来（除非你期望一个空文件）。

正则表达式没有找到任何内容的原因（或至少原因）是你实际上没有搜索下载的HTML（你已经存储在image中）但该HTML的URL（您存储在url中）：

imgUrl = re.findall('<img src="(.*)" />', url)

...显然，字符串'http://www.techradar.com/news/internet/web/12-best-places-to-get-free-images-for-your-site-624818'中的正则表达式没有匹配项。

通过使用python从网站获取图像

1 个答案: