Question

我正在编写Python脚本来从文件中读取Unicode字符并将它们插入到数据库中。我只能插入每个字符串的30个字节。在插入数据库之前，如何计算字符串的大小（以字节为单位）？

Answer 1

如果您需要知道字节数（文件大小），那么只需调用
bytes_count = os.path.getsize(filename)。

如果你想知道Unicode字符可能需要多少字节，那么它取决于字符编码：

>>> print(u"\N{EURO SIGN}")
€
>>> u"\N{EURO SIGN}".encode('utf-8') # 3 bytes
'\xe2\x82\xac'
>>> u"\N{EURO SIGN}".encode('cp1252') # 1 byte
'\x80'
>>> u"\N{EURO SIGN}".encode('utf-16le') # 2 bytes
'\xac '

要查明文件包含多少个Unicode字符，您不需要一次读取内存中的整个文件（如果它是一个大文件）：

with open(filename, encoding=character_encoding) as file:
    unicode_character_count = sum(len(line) for line in file)

如果您使用的是Python 2，请在顶部添加from io import open。

相同的人类可读文本的确切计数可能取决于Unicode规范化（不同的环境可能使用不同的设置）：

>>> import unicodedata
>>> print(u"\u212b")
Å
>>> unicodedata.normalize("NFD", u"\u212b") # 2 Unicode codepoints
u'A\u030a'
>>> unicodedata.normalize("NFC", u"\u212b") # 1 Unicode codepoint
u'\xc5'
>>> unicodedata.normalize("NFKD", u"\u212b") # 2 Unicode codepoints
u'A\u030a'
>>> unicodedata.normalize("NFKC", u"\u212b") # 1 Unicode codepoint
u'\xc5'

如示例所示，可以使用多个Unicode代码点表示单个字符（Å）。

要了解文件中有多少用户感知的字符，您可以使用\X正则表达式（计算扩展字形集群）：

import regex # $ pip install regex

with open(filename, encoding=character_encoding) as file:
    character_count = sum(len(regex.findall(r'\X', line)) for line in file)

示例：

>>> import regex
>>> char = u'A\u030a'
>>> print(char)
Å
>>> len(char)
2
>>> regex.findall(r'\X', char)
['Å']
>>> len(regex.findall(r'\X', char))
1

Answer 2

假设您正在从文件中将unicode字符读入名为byteString的变量中。然后，您可以执行以下操作：

unicode_string = byteString.decode("utf-8")
print len(unicode_string)

计算python中unicode字符的字节数

2 个答案: