Question

我正在使用Python和Django，但我遇到了由MySQL限制引起的问题。根据{{3}}，他们的utf8实现不支持4字节字符。 MySQL 5.1 documentation将使用utf8mb4支持4字节字符;并且，在将来的某一天，utf8也可能会支持它。

但是我的服务器还没有准备好升级到MySQL 5.5，因此我只限于3字节或更少的UTF-8字符。

我的问题是：如何过滤（或替换）超过3个字节的unicode字符？

我想用官方\ufffd（ U + FFFD REPLACEMENT CHARACTER ）或?替换所有4字节字符。

换句话说，我想要一种与Python自己的MySQL 5.5方法非常相似的行为（传递'replace'参数时）。 编辑：我想要一个类似于encode()的行为，但我不想实际编码字符串。我想在过滤后仍然有一个unicode字符串。

我不想在存储到MySQL之前转义字符，因为这意味着我需要从数据库中获取所有字符串，这非常烦人且不可行。

另见：

str.encode()（在Django门票系统中）
"Incorrect string value" warning when saving some unicode characters to MySQL（在Stack Overflow上）

[编辑]添加了有关建议的解决方案的测试

到目前为止我得到了很好的答案。谢谢，人！现在，为了选择其中一个，我做了一个快速测试，找到最简单和最快的一个。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'\uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

结果：

filter_using_re()在 0.139 CPU秒（sub()内置的0.138 CPU秒）中执行了515次函数调用
filter_using_python()在 3.413 CPU秒中执行2097923次函数调用（join()调用时为1.511 CPU秒，评估生成器表达式为1.900 CPU秒）
我没有使用itertools进行测试，因为......嗯......这个解决方案虽然很有趣，却非常庞大而复杂。

结论

到目前为止，RegEx解决方案是最快的解决方案。

Answer 1

范围\ u0000- \ uD7FF和\ uE000- \ uFFFF中的Unicode字符在UTF8中将具有3个字节（或更少）的编码。 \ uD800- \ uDFFF范围适用于多字节UTF16。我不知道python，但你应该能够设置一个正则表达式来匹配那些范围之外。

pattern = re.compile("[\uD800-\uDFFF].", re.UNICODE)
pattern = re.compile("[^\u0000-\uFFFF]", re.UNICODE)

编辑在问题正文中添加来自DenilsonSá脚本的Python：

re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'\uFFFD', unicode_string)

Answer 2

您可以跳过解码和编码步骤，直接检测每个字符的第一个字节（8位字符串）的值。根据UTF-8：

#1-byte characters have the following format: 0xxxxxxx
#2-byte characters have the following format: 110xxxxx 10xxxxxx
#3-byte characters have the following format: 1110xxxx 10xxxxxx 10xxxxxx
#4-byte characters have the following format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

根据这个，你只需要检查每个字符的第一个字节的值来过滤掉4个字节的字符：

def filter_4byte_chars(s):
    i = 0
    j = len(s)
    # you need to convert
    # the immutable string
    # to a mutable list first
    s = list(s)
    while i < j:
        # get the value of this byte
        k = ord(s[i])
        # this is a 1-byte character, skip to the next byte
        if k <= 127:
            i += 1
        # this is a 2-byte character, skip ahead by 2 bytes
        elif k < 224:
            i += 2
        # this is a 3-byte character, skip ahead by 3 bytes
        elif k < 240:
            i += 3
        # this is a 4-byte character, remove it and update
        # the length of the string we need to check
        else:
            s[i:i+4] = []
            j -= 4
    return ''.join(s)

跳过解码和编码部分可以节省一些时间，对于大多数具有1字节字符的较小字符串，这甚至可能比正则表达式过滤更快。

Answer 3

编码为UTF-16，然后重新编码为UTF-8。

>>> t = u''
>>> e = t.encode('utf-16le')
>>> ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e))
'\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8'

请注意，您无法在加入后进行编码，因为代理对可能会在重新编码之前被解码。

修改

MySQL（至少5.1.47）处理代理对没有问题：

mysql> create table utf8test (t character(128)) collate utf8_general_ci; Query OK, 0 rows affected (0.12 sec) ... >>> cxn = MySQLdb.connect(..., charset='utf8') >>> csr = cxn.cursor() >>> t = u'' >>> e = t.encode('utf-16le') >>> v = ''.join(unichr(x).encode('utf-8') for x in struct.unpack('<' + 'H' * (len(e) // 2), e)) >>> v '\xed\xa0\xb5\xed\xb0\x9f\xed\xa0\xb5\xed\xb0\xa8\xed\xa0\xb5\xed\xb0\xa8' >>> csr.execute('insert into utf8test (t) values (%s)', (v,)) 1L >>> csr.execute('select * from utf8test') 1L >>> r = csr.fetchone() >>> r (u'\ud835\udc1f\ud835\udc28\ud835\udc28',) >>> print r[0]

Answer 4

只是为了它的乐趣，itertools怪物：）

import itertools as it, operator as op

def max3bytes(unicode_string):

    # sequence of pairs of (char_in_string, u'\N{REPLACEMENT CHARACTER}')
    pairs= it.izip(unicode_string, it.repeat(u'\ufffd'))

    # is the argument less than or equal to 65535?
    selector= ft.partial(op.le, 65535)

    # using the character ordinals, return 0 or 1 based on `selector`
    indexer= it.imap(selector, it.imap(ord, unicode_string))

    # now pick the correct item for all pairs
    return u''.join(it.imap(tuple.__getitem__, pairs, indexer))

Answer 5

根据the MySQL 5.1 documentation：“ucs2和utf8字符集不支持BMP之外的补充字符。”这表明代理对可能存在问题。

请注意，Unicode standard 5.2 chapter 3实际上禁止将代理项对编码为两个3字节UTF-8序列而不是一个4字节UTF-8序列...请参阅第93页“”“因为代理代码这些点不是Unicode标量值，否则任何UTF-8字节序列都会映射到代码点D800..DFFF格式不正确。“”“然而，这种禁止是我所知的很大程度上未知或被忽略。

检查MySQL对代理对的作用可能是个好主意。如果不保留它们，则此代码将提供足够简单的检查：

all(uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' for uc in unicode_string)

并且此代码将使用u\ufffd替换任何“恶意”：

u''.join(
    uc if uc < u'\ud800' or u'\ue000' <= uc <= u'\uffff' else u'\ufffd'
    for uc in unicode_string
    )

Answer 6

我猜它不是最快的，但很简单（“pythonic”:)：

def max3bytes(unicode_string):
    return u''.join(uc if uc <= u'\uffff' else u'\ufffd' for uc in unicode_string)

注意：此代码不考虑到Unicode在U + D800-U + DFFF范围内具有代理字符这一事实。

如何过滤（或替换）UTF-8中超过3个字节的unicode字符？

[编辑]添加了有关建议的解决方案的测试

结论

6 个答案: