Question

以下是我的代码：

import re,urllib
from urllib import request, parse

def gh(url):
html=urllib.request.urlopen(url).read().decode('utf-8')
return html   

def gi(x):
    r=r'src="(.+?\.jpg)"'
    imgre=re.findall(r, x)
    y=0
    for iu in imgre:
        urllib.request.urlretrieve(iu, '%s.jpg' %y)
        y=y+1

va=gh('http://tieba.baidu.com/p/3497570603')
print(gi(va))

当它运行时，它会发生：

UnicodeEncodeError: 'ascii' codec can't encode character '\u65e5' in position 873: ordinal not in range(128)

我用'utf-8＆＃39;解码了网站的内容。变成字符串，＆＃39; ascii编解码器＆＃39;问题来自？

Answer 1

问题是http://tieba.baidu.com/p/3497570603的HTML内容包含对.png图像的引用，因此非贪婪的正则表达式匹配长文本字符串，例如

http://static.tieba.baidu.com/tb/editor/images/client/image_emoticon28.png" ><br><br><br><br>
...
title="蓝钻"><img src="http://imgsrc.baidu.com/forum/pic/item/bede9735e5dde711c981db20a0efce1b9f1661d5.jpg

使用包含非ASCII字符的长字符串组成的URL调用urlretrieve()方法会导致在尝试将URL参数转换为ASCII时抛出UnicodeEncodeError。

更好的正则表达式可以避免匹配过多的文本

 r=r'src="([^"]+?\.jpg)"'

调试

本着teaching someone to fish rather than simply giving them a fish for one day的精神，我建议您使用print语句来调试此类问题。通过将urllib.request.urlretrieve(iu, '%s.jpg' %y)行替换为print(iu)，我能够诊断出这个特定问题。

当我尝试下载图片时如何解决ascii问题？

1 个答案:

调试