Question

我正在处理几个二进制文件，我想解析存在的UTF-8字符串。

我目前有一个函数，它接受文件的起始位置，然后返回找到的字符串：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter != None and index != None):
       return file.read(size).explode('0x00000000')[index] #incorrect
   else:
       return file.read(size)

文件中的某些字符串由0x00 00 00 00分隔，是否有可能将这些字符串拆分为PHP爆炸？我是Python的新手，所以欢迎任何有关代码改进的指针。

示例文件：

<{> 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00 Hello World123，我已将00 00 00 00分隔符用|条括起来。

所以：

str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'Hello World'

类似地：

str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123'

Answer 1

我将假设你在这里使用Python 2，但是编写代码来处理Python 2和Python 3。

您有UTF-16数据，而不是UTF-8。您可以将其读作二进制数据，并使用str.split() method分割四个NUL字节：

Multiple installations of Google Tag Manager detected

We suggest you place only 1 instance of the GTM snippet on a webpage. Multiple GTM snippets don"t work well with each other
Multiple GTM snippets don"t work well with each other because of which the tag added via GTM may not always fire correctly.

Place only 1 instance of the GTM snippet on a webpage

结果数据编码为UTF-16 little-endian（您可能在开始时可能已经或可能没有省略UTF-16 BOM;您可以使用以下方法解码数据：

file.read(size).split(b'\x00' * 4)[index]

然而失败，因为我们只是在最后一个NUL字节处切断了文本; Python在找到的前4个NUL上进行拆分，并且不会跳过作为文本一部分的最后一个NUL字节。

更好的想法是首先解码为Unicode，然后拆分Unicode双NUL代码点：

result.decode('utf-16-le')

将它作为一个函数放在一起将是：

file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index]

如果文件在开始时作为BOM，请考虑将文件打开为UTF-16而不是以：

开头

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       delimiter = delimiter.decode('utf-16-le')  # or pass in Unicode
       return file.read(size).decode('utf-16-le').split(delimiter)[index]
   else:
       return file.read(size).decode('utf-16-le')

with open('filename', 'rb') as fobj:
    result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0)

并删除显式解码。

Python 2演示：

import io

with io.open('filename', 'r', encoding='utf16') as fobj:
    # ....

Answer 2

首先，您需要在binary mode中打开文件。

然后split str（或bytes，取决于Python的版本），分隔符为四个零字节b'\0\0\0\0'：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       return file.read(size).split(delimiter)[index]
   else:
       return file.read(size)

此外，您需要处理编码，因为str_extract仅返回二进制数据，而您的测试数据采用UTF-16小端，如Martijn Pieters所示：

>>> str_extract(file, 0x00, 0x20, b'\0\0\0\0', 0).decode('utf-16-le')
u'Hello World'

此外：使用is not None测试变量不是None。

Python：使用十六进制分隔符分割字节

2 个答案: