我试图在Python 3中对字符串中的重音字符进行规范化,如下所示:
from bs4 import BeautifulSoup
import os
def process_markup():
#the file is utf-8 encoded
fn = os.path.join(os.path.dirname(__file__), 'src.txt') #
markup = BeautifulSoup(open(fn), from_encoding="utf-8")
for player in markup.find_all("div", class_="glossary-player"):
text = player.span.string
print(format_filename(text)) # Python console shows mangled characters not in utf-8
player.span.string.replace_with(format_filename(text))
dest = open("dest.txt", "w", encoding="utf-8")
dest.write(str(markup))
def format_filename(s):
# prepare string
s = s.strip().lower().replace(" ", "-").strip("'")
# transliterate accented characters to non-accented versions
chars_in = "àèìòùáéíóú"
chars_out = "aeiouaeiou"
no_accented_chars = str.maketrans(chars_in, chars_out)
return s.translate(no_accented_chars)
process_markup()
输入的src.txt文件是utf-8编码的:
<div class="glossary-player">
<span class="gd"> Fàilte </span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd"> àèìòùáéíóú </span><span class="en"> aeiouaeiou </span>
</div>
输出文件dest.txt如下所示:
<div class="glossary-player">
<span class="gd">fã ilte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">ã ã¨ã¬ã²ã¹ã¡ã©ãã³ãº</span><span class="en"> aeiouaeiou </span>
</div>
我想让它看起来像这样:
<div class="glossary-player">
<span class="gd">failte</span><span class="en"> Welcome </span>
</div>
<div class="glossary-player">
<span class="gd">aeiouaeiou</span><span class="en"> aeiouaeiou </span>
</div>
我知道有像unidecode这样的解决方案但只是想知道我在这里做错了什么。
答案 0 :(得分:3)
chars.translate(no_accented_chars)
不会修改chars
。它返回一个应用了翻译的新字符串。如果要使用已翻译的字符串,请将其保存到变量(可能是原始chars
变量):
chars = chars.translate(no_accented_chars)
或直接将其传递给write
来电:
dest.write(chars.translate(no_accented_chars))
答案 1 :(得分:1)
我强烈怀疑您的HTML文件包含类似
的内容<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
这基本上迫使BeautifulSoup将UTF-8重新解释为ISO-8859-1(或者你拥有的遗产字符集; Windows-1252?Shudder)。
还有很多其他地方可以为HTML块添加charset=
属性,但这可能是典型的罪魁祸首。
答案 2 :(得分:0)
问题是,as triplee suggested,文件被解释为错误的编码。
文件中的数据是正确的(如十六进制转储所示),但可能由于缺少字符集声明,Python没有将其读作utf-8,而是作为cp1252。
要解决这个问题,有必要在使用Python的open()方法打开文件时明确说明编码,所以行:
markup = BeautifulSoup(open(fn), from_encoding="utf-8")
需要更改为:
markup = BeautifulSoup(open(fn, encoding="utf-8"))