Python库,用于将多字节字符转换为Python中的7位ASCII

时间:2016-07-07 15:38:32

标签: python encoding python-requests

是否有一个python库,它将多字节非ASCII字符转换为7位可显示ASCII的合理形式。这是为了避免对answer中提到的charmap进行硬编码Translating multi-byte characters into 7-bit ASCII in Python

编辑:我目前正在使用Python 2.7.11或更高版本而不是Python 3,但是考虑到Python 3解决方案的答案将被认为是有用的。

原因是:当我手动翻译时,我会想念一些:

我的脚本是:

#!/bin/bash
# -*- mode: python; -*-

import os
import re
import requests

url = "https://system76.com/laptops/kudu"

#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text  # ok, data is a true unicode string

# translate offending characters in unicode:

charmap = {
    0x2014: u'-',   # em dash
    0x201D: u'"',   # comma quotation mark, double
    # etc.
}
data = data.translate(charmap)
tdata = data.encode('ascii')

我得到的错误是:

./simple_wget
Traceback (most recent call last):
  File "./simple_wget.py", line 25, in <module>
    tdata = data.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 10166: ordinal not in range(128)

对于新发现的角色更新charmap,这将是一场永无止境的战斗。是否有一个提供此charmap的python库,所以我不必以这种方式对其进行硬编码?

3 个答案:

答案 0 :(得分:3)

您可以考虑unicodedata python包。我认为您可能感兴趣的方法之一是normalize(另请参阅peterbe.come给出的使用示例):

import unicodedata

foo = 'abcdéfg'
unicodedata.normalize(foo).encode('ascii','ignore')

答案 1 :(得分:1)

str.encode()有一个可选的'error'参数,可以替换不可编码的字符而不是抛出错误。那是你在找什么?

https://docs.python.org/3/howto/unicode.html#converting-to-bytes

答案 2 :(得分:0)

(注意:这个答案与Python 2.7.11 +有关。)

https://stackoverflow.com/a/1701378/257924的答案是指Unidecode包,是我想要的。在使用该软件包时,我还发现了我的混淆的最终根源,详细阐述了https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output,特别是本节:

  

挫折#3:输出的处理不一致

     

好吧,既然python社区正在向所有地方使用unicode字符串,我们不妨将所有内容转换为unicode字符串并默认使用它,对吧?听起来很好,但是   至少有一个值得注意的警告。无论何时将文本输出到终端或文件,都必须将文本转换为字节str。 Python将尝试从unicode隐式转换为   byte str ...但如果字节是非ASCII,它将抛出异常:

以下是我使用它的演示脚本。 names变量中列出的字符是我需要为我正在分析的网页类型翻译成可读的内容而不是删除的字符。

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
# The above coding is needed to to avoid this error: SyntaxError: Non-ASCII character '\xe2' in file ./unicodedata_normalize_test.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

import os
import re
import unicodedata
from unidecode import unidecode

names = [
    'HYPHEN-MINUS',
    'EM DASH',
    'EN DASH',
    'MINUS SIGN',
    'APOSTROPHE',
    'LEFT SINGLE QUOTATION MARK',
    'RIGHT SINGLE QUOTATION MARK',
    'LATIN SMALL LETTER A WITH ACUTE',
]

for name in names:
    character = unicodedata.lookup(name)
    unidecoded = unidecode(character)
    print
    print 'name      ',name
    print 'character ',character
    print 'unidecoded',unidecoded

上述脚本的示例输出是:

censored@censored:~$ unidecode_test

name       HYPHEN-MINUS
character  -
unidecoded -

name       EM DASH
character  —
unidecoded --

name       EN DASH
character  –
unidecoded -

name       MINUS SIGN
character  −
unidecoded -

name       APOSTROPHE
character  '
unidecoded '

name       LEFT SINGLE QUOTATION MARK
character  ‘
unidecoded '

name       RIGHT SINGLE QUOTATION MARK
character  ’
unidecoded '

name       LATIN SMALL LETTER A WITH ACUTE
character  á
unidecoded a

以下更详细的脚本会加载多个具有许多unicode字符的网页。请参阅以下脚本中的注释:

#!/bin/bash
# -*- mode: python; coding: utf-8 -*-

import os
import re
import subprocess
import requests
from unidecode import unidecode

urls = [
    'https://system76.com/laptops/kudu',
    'https://stackoverflow.com/a/38249916/257924',
    'https://www.peterbe.com/plog/unicode-to-ascii',
    'https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472',
    # Uncomment out the following to show that this script works without throwing exceptions, but at the expense of a huge amount of diff output:
    ###'https://en.wikipedia.org/wiki/List_of_Unicode_characters',
]

# The following variable settings represent what just works without throwing exceptions.
# Setting re_encode to False and not_encode to True results in the write function throwing an exception of
#
#    Traceback (most recent call last):
#      File "./simple_wget.py", line 52, in <module>
#        file_fp.write(data[ext])
#    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 33511: ordinal not in range(128)
#
# This is the crux of my confusion and is explained by https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
# So this is why we set re_encode to True and not_encode to False below:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True

for url in urls:
    #
    # Load the text from request as a true unicode string:
    #
    r = requests.get(url)
    print "\n\n\n"
    print "url:",url
    print "current encoding:",r.encoding

    data = {}

    if force_utf_8:
        # The next two lines do not work. They cause the write to fail:
        r.encoding = "UTF-8"
        data['old'] = r.text  # ok, data is a true unicode string

    if re_encode:
        data['old'] = r.text.encode(r.encoding)

    if not_encode:
        data['old'] = r.text

    if do_unidecode:
        # translate offending characters in unicode:
        data['new'] = unidecode(r.text)

    html_base = re.sub(r'[^a-zA-Z0-9_-]+', '__', url)
    diff_cmd = "diff "
    for ext in [ 'old', 'new' ]:
        if ext in data:
            print "ext:",ext
            html_file = "{}.{}.html".format(html_base, ext)
            with open(html_file, 'w') as file_fp:
                file_fp.write(data[ext])
                print "Wrote",html_file
            diff_cmd = diff_cmd + " " + html_file

    if 'old' in data and 'new' in data:
        print 'Executing:',diff_cmd
        subprocess.call(diff_cmd, shell=True)

上述脚本的gist showing the output。这显示了对“旧”和“新”html文件执行Linux diff命令以查看翻译。会有像德语等语言的误译,但这对于我的目的是获得单引号和双引号类型的字符和破折号字符的有损翻译。