Question

我正在尝试使用Windows 10上的 Python 2.7 从.PLM文件中提取目录名。.PLM文件是用于Panasonic语音记录器的专有文件格式，用于存储名称录音目录。

（例如：说我有录音，我想保存在文件夹“HelloÆØÅ”中，然后这个录音机创建一个名为“SV_VC001”的文件夹和一个名为“SD_VOICE.PLM”的文件，其中，一堆其他数据，存储字符串“HelloÆØÅ”）

现在，我是丹麦人，所以使用ascii不支持的字符Æ，Ø和Å，所以我必须将这个二进制数据转换为unicode。

到目前为止，我知道目录的名称是从字节56开始存储的，并以一个全0的字节结束。例如，一个记录存储在一个名为“2-3-15Årstidskredsløbetmichael”的目录中，该目录具有十六进制值：

322d 332d 3135 20c5 7274 6964 7320 6b72 
6564 736c f862 6574 206d 6963 6861 656c

这是我到目前为止使用的代码：

# Finds the filename in the .PLM-file
def  FindFileName(File):
    # Opens the file and points to byte 56, where the file name starts
    f = open(File,'rb')
    f.seek(56)
    Name = ""


    byte = f.read(1)        # Reads the first byte after byte 56
    while byte != "\x00":   # Runs the loop, until a NUL-character is found (00 is NUL in hex)
        Name += str(byte)   # Appends the current byte to the string Name
        byte = f.read(1)    # reads the next byte

    f.close()

    return Name

这样可行 - 只要目录名只使用ASCII字符（所以没有'æ'，'ø'或'å'）。

但是，如果字符串中有unicode字符，则会将其转换为其他字符。目录“2-3-15Årstidskredsløbetmichael”，此程序输出“2-3-15┼rtidskredsl°bet michael”

你有什么建议吗？非常感谢你提前。

修改

添加Mark Ransom的建议，代码如下。我也笨拙地试图处理发现的3个边缘情况：问号被改为空格，而\ xc5和\ xd8（分别为Å和Ø，分别为十六）变为å和ø。

def  FindFileName(File):
    # Opens the file and points to byte 56, where the file name starts
    f = open(File,'rb')
    f.seek(56)
    Name = ""


    byte = f.read(1)        # Reads the first byte after byte 56
    while byte and (byte != "\x00"):    # Runs the loop, until a NUL-character is found (00 is NUL in hex)

        # Since there are problems with "?" in directory names, we change those to spaces
        if byte == "?": 
            Name += " "
        elif byte == "\xc5":
            Name += "å"
        elif byte == "\xd8":
            Name += "ø"
        else:
            Name += byte

    byte = f.read(1)    # reads the next byte

f.close()

return Name.decode('mbcs')

对于大写Æ，Ø和Å产生以下错误：

WindowsError: [Error 123] The filename, directory name, or volume label syntax is incorrect: u'C:\\Users\\UserName\\Desktop\\TestDir\\Mapper\\13-10*14 ESSOTERISK \xc5NDSSTR\xd8MNIN'

该字符串应为“13-10 * 14ESSOTERISKÅNDSSTRØMNIN”，但Å和Ø（十六进制c5和d8）抛出错误。

Answer 1

在Python 2中，从二进制文件读取会返回一个字符串，因此不需要在其上使用str。此外，如果由于某种原因文件格式错误且其中没有零字节，read将返回空字符串。您可以通过对测试进行少量修改来检查这两种情况。

while byte and (byte != "\x00"):   # Runs the loop, until a NUL-character is found (00 is NUL in hex)
    Name += byte        # Appends the current byte to the string Name
    byte = f.read(1)    # reads the next byte

获得完整的字节序列后，必须将其转换为Unicode字符串。为此，您需要 decode ：

Name = Name.decode("utf-8")

正如评论中所提到的，您的字符串实际上并不是UTF-8，而是Microsoft的代码页之一。您可以从Windows当前使用的代码页进行解码：

Name = Name.decode("mbcs")

您可以明确指定使用代码页，请参阅the documentation。

尝试在控制台上打印字符串时可能会遇到麻烦，因为Windows控制台不使用与系统其余部分相同的代码页;它可能没有您需要打印的字符。

Python：在二进制文件（.PLM）中搜索unicode字符串

1 个答案: