是否有一个python库,它将多字节非ASCII字符转换为7位可显示ASCII的合理形式。这是为了避免对answer中提到的charmap
进行硬编码Translating multi-byte characters into 7-bit ASCII in Python
编辑:我目前正在使用Python 2.7.11或更高版本而不是Python 3,但是考虑到Python 3解决方案的答案将被认为是有用的。
原因是:当我手动翻译时,我会想念一些:
我的脚本是:
#!/bin/bash
# -*- mode: python; -*-
import os
import re
import requests
url = "https://system76.com/laptops/kudu"
#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
r.encoding = "UTF-8"
data = r.text # ok, data is a true unicode string
# translate offending characters in unicode:
charmap = {
0x2014: u'-', # em dash
0x201D: u'"', # comma quotation mark, double
# etc.
}
data = data.translate(charmap)
tdata = data.encode('ascii')
我得到的错误是:
./simple_wget
Traceback (most recent call last):
File "./simple_wget.py", line 25, in <module>
tdata = data.encode('ascii')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 10166: ordinal not in range(128)
对于新发现的角色更新charmap
,这将是一场永无止境的战斗。是否有一个提供此charmap的python库,所以我不必以这种方式对其进行硬编码?
答案 0 :(得分:3)
您可以考虑unicodedata python包。我认为您可能感兴趣的方法之一是normalize
(另请参阅peterbe.come给出的使用示例):
import unicodedata
foo = 'abcdéfg'
unicodedata.normalize(foo).encode('ascii','ignore')
答案 1 :(得分:1)
str.encode()
有一个可选的'error'参数,可以替换不可编码的字符而不是抛出错误。那是你在找什么?
https://docs.python.org/3/howto/unicode.html#converting-to-bytes
答案 2 :(得分:0)
(注意:这个答案与Python 2.7.11 +有关。)
https://stackoverflow.com/a/1701378/257924的答案是指Unidecode包,是我想要的。在使用该软件包时,我还发现了我的混淆的最终根源,详细阐述了https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output,特别是本节:
挫折#3:输出的处理不一致
好吧,既然python社区正在向所有地方使用unicode字符串,我们不妨将所有内容转换为unicode字符串并默认使用它,对吧?听起来很好,但是 至少有一个值得注意的警告。无论何时将文本输出到终端或文件,都必须将文本转换为字节str。 Python将尝试从unicode隐式转换为 byte str ...但如果字节是非ASCII,它将抛出异常:
以下是我使用它的演示脚本。 names
变量中列出的字符是我需要为我正在分析的网页类型翻译成可读的内容而不是删除的字符。
#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
# The above coding is needed to to avoid this error: SyntaxError: Non-ASCII character '\xe2' in file ./unicodedata_normalize_test.py on line 9, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
import os
import re
import unicodedata
from unidecode import unidecode
names = [
'HYPHEN-MINUS',
'EM DASH',
'EN DASH',
'MINUS SIGN',
'APOSTROPHE',
'LEFT SINGLE QUOTATION MARK',
'RIGHT SINGLE QUOTATION MARK',
'LATIN SMALL LETTER A WITH ACUTE',
]
for name in names:
character = unicodedata.lookup(name)
unidecoded = unidecode(character)
print
print 'name ',name
print 'character ',character
print 'unidecoded',unidecoded
上述脚本的示例输出是:
censored@censored:~$ unidecode_test
name HYPHEN-MINUS
character -
unidecoded -
name EM DASH
character —
unidecoded --
name EN DASH
character –
unidecoded -
name MINUS SIGN
character −
unidecoded -
name APOSTROPHE
character '
unidecoded '
name LEFT SINGLE QUOTATION MARK
character ‘
unidecoded '
name RIGHT SINGLE QUOTATION MARK
character ’
unidecoded '
name LATIN SMALL LETTER A WITH ACUTE
character á
unidecoded a
以下更详细的脚本会加载多个具有许多unicode字符的网页。请参阅以下脚本中的注释:
#!/bin/bash
# -*- mode: python; coding: utf-8 -*-
import os
import re
import subprocess
import requests
from unidecode import unidecode
urls = [
'https://system76.com/laptops/kudu',
'https://stackoverflow.com/a/38249916/257924',
'https://www.peterbe.com/plog/unicode-to-ascii',
'https://stackoverflow.com/questions/227459/ascii-value-of-a-character-in-python?rq=1#comment35813354_227472',
# Uncomment out the following to show that this script works without throwing exceptions, but at the expense of a huge amount of diff output:
###'https://en.wikipedia.org/wiki/List_of_Unicode_characters',
]
# The following variable settings represent what just works without throwing exceptions.
# Setting re_encode to False and not_encode to True results in the write function throwing an exception of
#
# Traceback (most recent call last):
# File "./simple_wget.py", line 52, in <module>
# file_fp.write(data[ext])
# UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 33511: ordinal not in range(128)
#
# This is the crux of my confusion and is explained by https://pythonhosted.org/kitchen/unicode-frustrations.html#frustration-3-inconsistent-treatment-of-output
# So this is why we set re_encode to True and not_encode to False below:
force_utf_8 = False
re_encode = True
not_encode = False
do_unidecode = True
for url in urls:
#
# Load the text from request as a true unicode string:
#
r = requests.get(url)
print "\n\n\n"
print "url:",url
print "current encoding:",r.encoding
data = {}
if force_utf_8:
# The next two lines do not work. They cause the write to fail:
r.encoding = "UTF-8"
data['old'] = r.text # ok, data is a true unicode string
if re_encode:
data['old'] = r.text.encode(r.encoding)
if not_encode:
data['old'] = r.text
if do_unidecode:
# translate offending characters in unicode:
data['new'] = unidecode(r.text)
html_base = re.sub(r'[^a-zA-Z0-9_-]+', '__', url)
diff_cmd = "diff "
for ext in [ 'old', 'new' ]:
if ext in data:
print "ext:",ext
html_file = "{}.{}.html".format(html_base, ext)
with open(html_file, 'w') as file_fp:
file_fp.write(data[ext])
print "Wrote",html_file
diff_cmd = diff_cmd + " " + html_file
if 'old' in data and 'new' in data:
print 'Executing:',diff_cmd
subprocess.call(diff_cmd, shell=True)
上述脚本的gist showing the output。这显示了对“旧”和“新”html文件执行Linux diff
命令以查看翻译。会有像德语等语言的误译,但这对于我的目的是获得单引号和双引号类型的字符和破折号字符的有损翻译。