Question

我有一个PHP脚本，可以在目录中创建文件列表，但是，PHP只能看到英文文件名，完全忽略其他语言的文件名，例如俄语或亚洲语言。

经过多方努力，我找到了唯一可以解决的解决方案 - 使用将文件重命名为UTF8的python脚本，因此PHP脚本可以在此之后处理它们。

（在PHP处理完文件后，我将文件重命名为英文，我不将它们保存在UTF8中）。

我使用了以下python脚本，工作正常：

import sys
import os
import glob
import ntpath
from random import randint

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      infile_utf8 = infile.encode('utf8')
      os.rename(infile, infile_utf8)

问题是它还会转换已经是UTF8的文件名。如果文件名已经是UTF8，我需要一种跳过转换的方法。

我正在尝试这个python脚本：

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        infile.decode('UTF-8', 'strict')
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

但是，如果文件名已经在utf8中，我会收到致命的错误：

UnicodeDecodeError: 'ascii' codec can't decode characters in position 18-20
ordinal not in range(128)

我也尝试了另一种方式，但也没有用：

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        tmpstr = str(infile)
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

我得到了与以前完全相同的错误。

有什么想法吗？

Python对我来说是一个新手，对我来说调试一个简单的脚本是一项巨大的工作，所以请写一个明确的答案（即代码）。我没有能力测试可能有效或无效的一般想法。感谢。

文件名示例：

 hello.txt
 你好.txt
 안녕하세요.html
 chào.doc

Answer 1

我认为你会混淆你的术语并做出一些错误的假设。 AFAIK，PHP可以打开任何编码类型的文件名 - PHP对编码类型非常不了解。

你还不清楚你想要达到什么样的UTF-8！=英语和示例外来文件名可以用多种方式编码，但绝不能用ASCII英文编码！你能解释一下你认为现有的UTF-8文件是什么样的，以及非UTF-8文件是什么吗？

为了增加您的困惑，在Windows下，文件名透明地存储为UTF-16。因此，您不应尝试将文件名编码为UTF-8。相反，您应该使用Unicode字符串并允许Python计算出正确的转换。（不要用UTF-16编码！）

请进一步澄清你的问题。

<强>更新：

我现在了解PHP的问题。 http://evertpot.com/filesystem-encoding-and-php/告诉我们，非拉丁字符在PHP + Windows中很麻烦。似乎只能看到和打开由Windows 1252字符集字符组成的文件。

您面临的挑战是将文件名转换为Windows 1252兼容。正如您在问题中所述，最好不要重命名已经兼容的文件。我把你的尝试改写为：

import os
from glob import glob
import shutil
import urllib

files = glob(u'*.txt')
for my_file in files:
    try:
        print "File %s" % my_file
    except UnicodeEncodeError:
        print "File (escaped): %s" % my_file.encode("unicode_escape")
    new_name = my_file
    try:
        my_file.encode("cp1252" , "strict")
        print "    Name unchanged. Copying anyway"
    except UnicodeEncodeError:
        print "    Can not convert to cp1252"
        utf_8_name = my_file.encode("UTF-8")
        new_name = urllib.quote(utf_8_name )
        print "    New name: (%% encoded): %s" % new_name

    shutil.copy2(my_file, os.path.join("fixed", new_name))

击穿：

打印文件名。默认情况下，Windows shell仅在本地DOS代码页中显示结果。例如，我的shell可以显示ü.txt，但€.txt显示为?.txt。因此，您需要小心Python抛出异常，因为它无法正确打印。此代码尝试打印Unicode版本，但转而使用打印Unicode代码点转义。
尝试将字符串编码为Windows-1252。如果这样做，文件名就可以了
否则：将文件名转换为UTF-8，然后对其进行百分比编码。这样，文件名保持唯一，您可以在PHP中反转此过程。
将文件复制到新的/已验证的文件。

例如，你好.txt成为％E4％BD％A0％E5％A5％BD.txt

Answer 2

对于Python的所有UTF-8问题，我热烈建议在PyCon 2012上花36分钟观看Ned Batchelder（http://nedbatchelder.com/text/unipain.html）的“Pragmatic Unicode”。对我而言，这是一个启示！这个演示文稿中的很多内容实际上不是特定于Python的，但有助于理解重要的事情，例如 Unicode字符串和 UTF-8编码字节之间的差异......

我向你推荐此视频的原因（就像我为很多朋友所做的那样）是因为如果解码失败，你的某些代码会包含试图decode和encode之类的矛盾：这样的方法不能适用于同一个对象！尽管在Python2中它可能是语法上可能的，但它没有任何意义，而在Python 3中，bytes和str之间的区别使事情变得更加清晰：

str中的bytes对象可以编码：

>>> a = 'a'
>>> type(a)
<class 'str'>
>>> a.encode
<built-in method encode of str object at 0x7f1f6b842c00>
>>> a.decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

... bytes中的str对象可以已解码：

>>> b = b'b'
>>> type(b)
<class 'bytes'>
>>> b.decode
<built-in method decode of bytes object at 0x7f1f6b79ddc8>
>>> b.encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'

回到你使用文件名的问题，你需要回答的棘手问题是：“文件名的编码是什么”。语言无关紧要，只有编码！

Python如何检查文件名是否为UTF8？

2 个答案: