我有没有正确的方法?无论如何,我正在解析很多HTML,但我并不总是知道它的编码是什么(一个令人惊讶的数字谎言)。下面的代码很容易显示到目前为止我一直在做的事情,但我确信有更好的方法。非常感谢您的建议。
import logging
import codecs
from utils.error import Error
class UnicodingError(Error):
pass
# these encodings should be in most likely order to save time
encodings = [ "ascii", "utf_8", "big5", "big5hkscs", "cp037", "cp424", "cp437", "cp500", "cp737", "cp775", "cp850", "cp852", "cp855",
"cp856", "cp857", "cp860", "cp861", "cp862", "cp863", "cp864", "cp865", "cp866", "cp869", "cp874", "cp875", "cp932", "cp949",
"cp950", "cp1006", "cp1026", "cp1140", "cp1250", "cp1251", "cp1252", "cp1253", "cp1254", "cp1255", "cp1256", "cp1257", "cp1258",
"euc_jp", "euc_jis_2004", "euc_jisx0213", "euc_kr", "gb2312", "gbk", "gb18030", "hz", "iso2022_jp", "iso2022_jp_1", "iso2022_jp_2",
"iso2022_jp_2004", "iso2022_jp_3", "iso2022_jp_ext", "iso2022_kr", "latin_1", "iso8859_2", "iso8859_3", "iso8859_4", "iso8859_5",
"iso8859_6", "iso8859_7", "iso8859_8", "iso8859_9", "iso8859_10", "iso8859_13", "iso8859_14", "iso8859_15", "johab", "koi8_r", "koi8_u",
"mac_cyrillic", "mac_greek", "mac_iceland", "mac_latin2", "mac_roman", "mac_turkish", "ptcp154", "shift_jis", "shift_jis_2004",
"shift_jisx0213", "utf_32", "utf_32_be", "utf_32_le", "utf_16", "utf_16_be", "utf_16_le", "utf_7", "utf_8_sig" ]
def unicode(string):
'''make unicode'''
for enc in self.encodings:
try:
logging.debug("unicoder is trying " + enc + " encoding")
utf8 = unicode(string, enc)
logging.info("unicoder is using " + enc + " encoding")
return utf8
except UnicodingError:
if enc == self.encodings[-1]:
raise UnicodingError("still don't recognise encoding after trying do guess.")
答案 0 :(得分:9)
有两个用于检测未知编码的通用库:
chardet应该是way that firefox does it
的端口您可以使用以下正则表达式从字节字符串中检测utf8:
import re
utf8_detector = re.compile(r"""^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$""", re.X)
在实践中,如果你正在处理英语,我发现以下作品占99.9%的时间:
答案 1 :(得分:2)
我已经解决了同样的问题,发现如果没有关于内容的元数据,就无法确定内容的编码类型。这就是为什么我最终采用你在这里尝试的相同方法。
我对你所做的唯一的建议是,不是按照最可能的顺序排列可能的编码列表,而应该按特异性排序。我发现某些字符集是其他字符集的子集,所以如果你选择utf_8
作为你的第二选择,你将永远找不到utf_8
的子集(我认为是韩国字符之一)集合使用与utf相同的数字空间。
答案 2 :(得分:1)
由于您使用的是Python,因此可以尝试使用UnicodeDammit
。它也是Beautiful Soup的一部分,你也可能觉得它很有用。
顾名思义,UnicodeDammit
会尝试尽一切努力从世界上找到的垃圾中取出适当的unicode。