我正在尝试创建一个小脚本来替换文件中的一组字符,如下所示:
# coding=utf-8
import codecs
import os
import sys
args = sys.argv
if len(args) > 1:
subtitleFileName = args[1]
newSubtitleFileName = subtitleFileName + "_new"
replacePairs = {
u"ã": "ă",
u"Ã": "Ă",
u"º": "ș",
u"ª": "Ș",
u"þ": "ț",
u"Þ": "Ț",
}
if os.path.isfile(subtitleFileName):
oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")
subtitleContent = oldSubtitleFile.read()
subtitleContent = codecs.encode(subtitleContent, "utf-8")
for key, value in replacePairs.iteritems():
subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)
oldSubtitleFile.close()
newSubtitleFile = open(newSubtitleFileName, "wb")
newSubtitleFile.write(subtitleContent)
newSubtitleFile.close()
os.remove(subtitleFileName)
os.rename(newSubtitleFileName, subtitleFileName)
print "Done!"
else:
print "Missing subtitle file!"
else:
print "Missing arguments!"
它适用于第一次运行。
因此,如果我有一个包含Eºti sigur cã vrei sã ºtergi fiºierele?
的文件,在该文件上运行脚本后,我得到Ești sigur că vrei să ștergi fișierele?
这就是我想要的。但如果我多次运行它,我会得到:
EÈtisigurcÄvreisÄÈtergifiÈierele?
EĂtigigócvreisĂÂÂÂtergifiĂÂèlele?
EÄÂĂÂtisigurcÄÂÂ?vreisÄÂÂÂÂÂÂÂÂÂÂ?ÂÂ?giĂgitergigifi?
EĂÂÂÂÂÂÂÂÂtisigurcĂÂÂÂÂÂÂÂÂÂ?vreisĂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ?ÂÂÂÂÂÂ?
我不明白为什么。如何在文件(ã,º等)中找到一些不存在的字符才能替换它们?为什么它甚至用其他角色取代它们?
答案 0 :(得分:3)
简单 - 这是因为在第一次运行时你正在阅读ISO-8859-1并编写UTF-8。然后在第二次运行时你做的完全相同,尽管输入现在是UTF-8而不是ISO-8859-1。在后续运行中,搜索和替换不再有效。
此测试模仿您的第二次迭代 - 将UTF-8解释为ISO-8859-1:
# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?
下一次迭代看起来像:
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?
Heed @ Daniel建议解码一次,用Unicode替换Unicode然后编码一次。我也被告知最好使用io.open()
而不是codecs
,因为它的Python 3兼容并解决了通用新行的问题。
答案 1 :(得分:0)
不要使用编码内容。仅在编写新文件时进行编码:
import codecs
import os
import sys
args = sys.argv
if len(args) > 1:
subtitleFileName = args[1]
newSubtitleFileName = subtitleFileName + "_new"
replacePairs = {
u"ã": u"ă",
u"Ã": u"Ă",
u"º": u"ș",
u"ª": u"Ș",
u"þ": u"ț",
u"Þ": u"Ț",
}
if os.path.isfile(subtitleFileName):
with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
subtitleContent = oldSubtitleFile.read()
for key, value in replacePairs.iteritems():
subtitleContent = subtitleContent.replace(key, value)
with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
newSubtitleFile.write(subtitleContent)
os.remove(subtitleFileName)
os.rename(newSubtitleFileName, subtitleFileName)
print "Done!"
else:
print "Missing subtitle file!"
else:
print "Missing arguments!"
答案 2 :(得分:0)
在"ISO-8859-1"
内容上使用"utf-8"
字符编码是不正确的:第一次运行脚本时,它需要一个文本文件(大概是"ISO-8859-1"
编码的)并将其保存为{替换某些Unicode字符时{1}}。
然后您第二次运行转换,然后它需要"utf-8"
个内容,并尝试将其解释为错误的"utf-8"
。
为避免混淆,请将替换与字符编码的更改分开。因此,替代品将是幂等的。
要进行替换,您可以使用"ISO-8859-1"
模块和fileinput
:
unicode.translate()
要控制输出文件的编码,您可以设置#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Replace some characters in 'iso-8859-1'-encoded files."""
import fileinput # read files given on the command-line and/or stdin
replacements = {
u"ã": u"ă",
u"Ã": u"Ă",
u"º": u"ș",
u"ª": u"Ș",
u"þ": u"ț",
u"Þ": u"Ț",
}
# key => ord(key)
replacements = dict(zip(map(ord, replacements.keys()), replacements.values()))
for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")):
print(line.translate(replacements))
,例如,在bash:
PYTHONIOENCODING
此命令均替换字符并将输入从$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8
转码为"iso-8859-1"
。
如果输入"utf-8"
已经损坏(没有单个字符编码正确解码),那么您可以try ftfy
module修复常见的编码错误:
filename.txt