在Python中只替换一次unicode字符

时间:2015-03-29 19:34:40

标签: python unicode

我正在尝试创建一个小脚本来替换文件中的一组字符,如下所示:

# coding=utf-8

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": "ă",
        u"Ã": "Ă",
        u"º": "ș",
        u"ª": "Ș",
        u"þ": "ț",
        u"Þ": "Ț",
    }

    if os.path.isfile(subtitleFileName):
        oldSubtitleFile = codecs.open(subtitleFileName, "rb", "ISO-8859-1")

        subtitleContent = oldSubtitleFile.read()
        subtitleContent = codecs.encode(subtitleContent, "utf-8")

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(codecs.encode(key, "utf-8"), value)

        oldSubtitleFile.close()

        newSubtitleFile = open(newSubtitleFileName, "wb")
        newSubtitleFile.write(subtitleContent)
        newSubtitleFile.close()

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

它适用于第一次运行。

因此,如果我有一个包含Eºti sigur cã vrei sã ºtergi fiºierele?的文件,在该文件上运行脚本后,我得到Ești sigur că vrei să ștergi fișierele?这就是我想要的。但如果我多次运行它,我会得到:

  

EÈtisigurcÄvreisÄÈtergifiÈierele?

     

EĂtigigócvreisĂÂÂÂtergifiĂÂèlele?

     

EÄÂĂÂtisigurcÄÂÂ?vreisÄÂÂÂÂÂÂÂÂÂÂ?ÂÂ?giĂgitergigifi?

     

EĂÂÂÂÂÂÂÂÂtisigurcĂÂÂÂÂÂÂÂÂÂ?vreisĂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ?ÂÂÂÂÂÂ?

我不明白为什么。如何在文件(ã,º等)中找到一些不存在的字符才能替换它们?为什么它甚至用其他角色取代它们?

3 个答案:

答案 0 :(得分:3)

简单 - 这是因为在第一次运行时你正在阅读ISO-8859-1并编写UTF-8。然后在第二次运行时你做的完全相同,尽管输入现在是UTF-8而不是ISO-8859-1。在后续运行中,搜索和替换不再有效。

此测试模仿您的第二次迭代 - 将UTF-8解释为ISO-8859-1:

# -*- coding: utf-8 -*-
print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1")
>> EÈti sigur cÄ vrei sÄ Ètergi fiÈierele?

下一次迭代看起来像:

print "Ești sigur că vrei să ștergi fișierele?".decode("ISO-8859-1").encode("utf-8").decode("ISO-8859-1")
>> EÃÂti sigur cÃÂ vrei sÃÂ ÃÂtergi fiÃÂierele?

Heed @ Daniel建议解码一次,用Unicode替换Unicode然后编码一次。我也被告知最好使用io.open()而不是codecs,因为它的Python 3兼容并解决了通用新行的问题。

答案 1 :(得分:0)

不要使用编码内容。仅在编写新文件时进行编码:

import codecs
import os
import sys

args = sys.argv

if len(args) > 1:
    subtitleFileName = args[1]
    newSubtitleFileName = subtitleFileName + "_new"

    replacePairs = {
        u"ã": u"ă",
        u"Ã": u"Ă",
        u"º": u"ș",
        u"ª": u"Ș",
        u"þ": u"ț",
        u"Þ": u"Ț",
    }

    if os.path.isfile(subtitleFileName):
        with codecs.open(subtitleFileName, "rb", "ISO-8859-1") as oldSubtitleFile:
            subtitleContent = oldSubtitleFile.read()

        for key, value in replacePairs.iteritems():
            subtitleContent = subtitleContent.replace(key, value)

        with codecs.open(newSubtitleFileName, "wb", "utf-8") as newSubtitleFile:
            newSubtitleFile.write(subtitleContent)

        os.remove(subtitleFileName)
        os.rename(newSubtitleFileName, subtitleFileName)

        print "Done!"
    else:
        print "Missing subtitle file!"
else:
    print "Missing arguments!"

答案 2 :(得分:0)

"ISO-8859-1"内容上使用"utf-8"字符编码是不正确的:第一次运行脚本时,它需要一个文本文件(大概是"ISO-8859-1"编码的)并将其保存为{替换某些Unicode字符时{1}}。

然后您第二次运行转换,然后它需要"utf-8"个内容,并尝试将其解释为错误"utf-8"

为避免混淆,请将替换与字符编码的更改分开。因此,替代品将是幂等的。

要进行替换,您可以使用"ISO-8859-1"模块和fileinput

unicode.translate()

要控制输出文件的编码,您可以设置#!/usr/bin/env python # -*- coding: utf-8 -*- """Replace some characters in 'iso-8859-1'-encoded files.""" import fileinput # read files given on the command-line and/or stdin replacements = { u"ã": u"ă", u"Ã": u"Ă", u"º": u"ș", u"ª": u"Ș", u"þ": u"ț", u"Þ": u"Ț", } # key => ord(key) replacements = dict(zip(map(ord, replacements.keys()), replacements.values())) for line in fileinput.input(openhook=fileinput.hook_encoded("iso-8859-1")): print(line.translate(replacements)) ,例如,在bash:

PYTHONIOENCODING

此命令均替换字符并将输入从$ PYTHONIOENCODING=utf-8 python replace-chars.py iso-8859-1.txt >replaced.utf-8 转码为"iso-8859-1"

如果输入"utf-8"已经损坏(没有单个字符编码正确解码),那么您可以try ftfy module修复常见的编码错误:

filename.txt