Question

我在Ubuntu 12 x64上使用Python 2.7.3。

我的文件系统上的文件夹中有大约200,000个文件。某些文件的文件名包含html编码和转义字符，因为这些文件最初是从网站下载的。以下是示例：

牙买加％2008％20114.jpg
thai_trip_％E8％B0％83％E6％95％B4％E5％A4％A7％E5％B0％8F％20RAY_5313.jpg

我编写了一个简单的Python脚本，该脚本遍历文件夹并重命名文件名中包含编码字符的所有文件。只需解码构成文件名的字符串即可实现新文件名。

该脚本适用于大多数文件，但是，对于某些文件，Python会扼杀并吐出以下错误：

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11: ordinal not in range(128)
Traceback (most recent call last):
  File "./download.py", line 53, in downloadGalleries
    numDownloaded = downloadGallery(opener, galleryLink)
  File "./download.py", line 75, in downloadGallery
    filePathPrefix = getFilePath(content)
  File "./download.py", line 90, in getFilePath
    return cleanupString(match.group(1).strip()) + '/' + cleanupString(match.group(2).strip())
  File "/home/abc/XYZ/common.py", line 22, in cleanupString
    return HTMLParser.HTMLParser().unescape(string)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)

以下是我的cleanupString函数的内容：

def cleanupString(string):
    string = urllib2.unquote(string)

    return HTMLParser.HTMLParser().unescape(string)

这里是调用cleanupString函数的代码片段（此代码与上面的回溯中的代码不同，但它会产生相同的错误）：

rootFolder = sys.argv[1]
pattern = r'.*\.jpg\s*$|.*\.jpeg\s*$'
reobj = re.compile(pattern, re.IGNORECASE)
imgs = []

for root, dirs, files in os.walk(rootFolder):
    for filename in files:
        foundFile = os.path.join(root, filename)

        if reobj.match(foundFile):
            imgs.append(foundFile)

for img in imgs :
    print 'Checking file: ' + img
    newImg = cleanupString(img) #Code blows up here for some files

有人能为我提供一种解决此错误的方法吗？我已经尝试添加

了

# -*- coding: utf-8 -*-

到脚本的顶部，但没有效果。

感谢。

Answer 1

您的文件名是字节字符串，包含代表unicode字符的UTF-8字节。 HTML解析器通常使用unicode数据而不是字节字符串，特别是遇到＆符号转义时，Python会自动尝试为您解码值，但默认情况下使用ASCII进行解码。这对UTF-8数据失败，因为它包含超出ASCII范围的字节。

您需要将字符串显式解码为unicode对象：

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string)

您的下一个问题是您现在拥有unicode文件名，但您的文件系统需要某种编码才能使用这些文件名。您可以使用sys.getfilesystemencoding()检查编码的内容;使用它来重新编码您的文件名：

def cleanupString(string):
    string = urllib2.unquote(string).decode('utf8')

    return HTMLParser.HTMLParser().unescape(string).encode(sys.getfilesystemencoding())

您可以在Unicode HOWTO中了解Python如何处理Unicode。

Answer 2

看起来你正在碰到this issue。我会尝试撤消您拨打unescape和unquote的订单，因为unquote会在文件名中添加非ASCII字符，但这可能无法解决问题。

它窒息的实际文件名是什么？

处理文件名时出现UnicodeDecodeError

2 个答案: